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PREFACE 


The  U.S.  Army  Research  Office  (ARO)  sponsored  Workshop  on  ’’Future  Directions 
In  Computer  Architecture  AND_  software^  convened  on  Monday ,  5  May  1986, at  the 
Seabrook  Conference  Resort  'in  Charleston,  $C.^  LtC.  Frank  Ward,  Office  of 
Under  Secretary  of  Defense,  delivered  the  keynote  address  which  contained 
thought  provoking  examples  and  a  challenge  on  the  technological  Issues  facing 
the  computer  hardware  and  software  technical  community.  A  distinguished 
speaker  lecture  was  presented  by  Dr.  Stephen  F.  Lundstrom,  Vice-President  and 
Program  Director  of  Parallel  Processing  at  Microelectronics  and  Computer 
Technology  Corporation  (MCC). 

The  workshop  was, attended  by  95  professionals:  five  from  government  agencies, 
22  fromjir+vate  Industry,  and  the  remalner  from  various  academic  Institutions. 
Ovejs^sfxty  formal  position  papers  were  organized  in  19  serial  sessions  for  the 
ffiree  day  workshop.  The  final  afternoon  of  the  workshop  was  utilized  to 
establish  definitive  insights  Into  the  perceived  future  research  thrusts  in 
computer  hardware  and  software. 

’Topics  presented  in  the  position  papers  varied  from  VLSI  design  to  retargetabl 
software  and  from  database  to  graphics.  A  list  of  the  topics  discussed  at  the 
workshop  Includes: 


—  Instruction  Set  Considerations  > 

CustonT^fiTps^:  '* - ^ 

1  Memory Ifferarchy  and  Parallel  Architecture' 

_  ~  Interconnection  Strategics.. _ J . 

^  Reconf i^uritlon^Strategi  es^  \ 

Granularity  1  ssues  :  "■ — 

rMappTngTATc[T5r1  thms  and  Task  Assignment) 

^Reusable  and  Retargetabl e  Software, - 
distributed  Operating  Systems n 
Concurrency  Control^  . 
rJMl®  PlFSTteli  sm  and  Support, 

Di strlbuted  Computing  Systems,^ 

Architecture  and  'Software~ls.$ue$_ 

Logic  and  Functional  Programming 
VLSI  Related  Issues  • — ^ 

Applicative  Languag’e  and  Data  Flow  Techniaues  _ 

The  objectives  for  this  workshop  were  to  inform  the  research  community  of  the 
research  opportunities  offered  by  the  Army  Research  office  (ARO)  and  to  set 
general  directions  and  priorities  for  *uture  research  sponsored  by  ARO.  These 
recommended  research  thrusts  are  Included  as  part  of  these  proceedings  and 
should  prove  helpful  in  focusing  future  researchers  on  research  topics  of 
special  Interest  to  the  Army. 


C.  Ronald  Green 
Workshop  Sponsor 
Army  Research  Office 
Electronics  Division 
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C.  Ronald  Green 
Army  Research  Office 


THANK  YOU  VERY  MUCH  FOR  THE  INVITATION  TO  SPEAK  HERE  TODAY. 
IT  IS  AN  HONOR  TO  ADDRESS  A  SELECT  GROUP  SUCH  AS  THIS  ONE.  I  AM 
ESPECIALLY  PLEASED  SINCE  I  CAN  APPROACH  THE  TOPIC  OF  THIS 
CONFERENCE,  FUTURE  DIRECTIONS  IN  COMPUTER  SOFTWARE  AND 
ARCHITECTURE  FROM  TWO  PERSPECTIVES.  FIRST,  I  CAN  SEE  THE 
IMPORTANCE  OF  THE  CONFERENCE  FROM  THE  VIEWPOINT  OF  THE  DEPARTMENT 
OF  DEFENSE,  SINCE  MY  BOSS  IS  THE  DIRECTOR  OF  COMPUTER  SOFTWARE 
AND  SYSTEMS  FOR  THE  DEPARTMENT.  SECONDLY,  AND  MOST  IMPORTANT,  I 
CAN  TAKE  THE  VIEW  OF  AN  ARMY  OFFICER,  KNOWING  THAT  THE  ARMY  AND 
THE  OTHER  MILITARY  SERVICES  ARE  TOTALLY  DEPENDENT  ON  COMPUTERS  TO 
ACCOMPLISH  OUR  COMBAT  MISSION.  THERE  ARE  TWO  VERY  IMPORTANT 
OUTPUTS  EXPECTED  FROM  THIS  CONFERENCE.  THE  FIRST  IS  INTANGIBLE. 
IT  IS  THE  SHARING  OF  KNOWLEDGE  BETWEEN  THE  PARTICIPANTS  WHICH 
WILL  ADVANCE  THE  GROUP’S  LEARNING  AND  ABILITY.  THE  SECOND  IS  A 
SET  OF  GROUP  RECOMMENDATIONS  ON  RESEARCH  TO  BE  PURSUED  IN  THE 
AREAS  WE’VE  TALKED  ABOUT  IN  THE  CONFERENCE. 

THE  SCHEDULE  FOR  THE  WORKSHOP  IS  AN  AMBITIOUS  ONE.  WE  HAVE 
ELEVEN  SESSIONS  ADDRESSING  IMPORTANT  TOPICS  SUCH  AS  DISTRIBUTED 
AND  PARALLEL  HARDWARE  AND  SOFTWARE,  REUSABLE  SOFTWARE, 
INTERCONNECTION  STRATEGIES,  CUSTOM  HARDWARE,  AND  AN  ISSUE  NEAR 
AND  DEAR  TO  MY  HEART,  INSTRUCTION  SET  ARCHITECTURES. 


WE  ARE  FORTUNATE  TO  HAVE  AS  OUR  DISTINGUISHED  SPEAKER  DR. 
STEVE  LUNDSTROM,  VICE  PRESIDENT  OF  THE  MICROELECTRONICS  AND 
COMPUTING  CONSORTIUM.  HE  AND  HIS  COLLEAGUES  ARE  CONDUCTING  FIRST 
RATE  RESEARCH  IN  AREAS  IMPORTANT  TO  THIS  CONFERENCE. 

I  AM  HERE  ON  BEHALF  OF  MY  BOSS,  DR.  EDWARD  LIEBLEIN ,  WHO 
FULLY  INTENDED  TO  ACCEPT  YOUR  INVITATION,  BUT  WAS  PREEMPTED  BY  A 
SHORT  FUSED  COMMITMENT  AT  THE  LAST  MOMENT.  HE  SENDS  HIS 
REGRETS.  HE  AND  THE  DEPUTY  UNDER  SECRETARY  OF  DEFENSE  FOR 
RESEARCH  AND  ADVANCED  TECHNOLOGY,  APPRECIATE  THE  OPPORTUNITY  TO 
SHARE  OSD’S  VIEWPOINTS  WITH  YOU,  AND  THEY  UNDERSTAND  THE 
IMPORTANCE  OF  COMPUTERS  TO  OUR  MISSION  TO  PROTECT  THE  SECURITY  OF 
THE  UNITED  STATES. 

THIS  IS  A  VERY  IMPORTANT  CONFERENCE  BECAUSE  IT  ADDRESSES  THE 
FUTURE  OF  OUR  MILITARY  USE  OF  COMPUTERS.  YOU  PROBABLY  KNOW  THAT 
VIRTUALLY  EVERY  SYSTEM  WE  PLAN  TO  FIELD  IN  THE  NEXT  FEW  YEARS  IS 
CRITICALLY  DEPENDENT  ON  ONE  OR  MORE  COMPUTERS  TO  GET  THE  JOB 
DONE.  WE  WERE  WORKING  ON  SOME  REMARKS  FOR  THE  UNDER  SECRETARY  OF 
DEFENSE  FOR  RESEARCH  AND  ENGINEERING,  DR.  DON  HICKS,  AND  WE  CAST 
ABOUT  FOR  A  SYSTEM  THAT  DID  NOT  USE  A  COMPUTER.  WE  FOUND  ONE, 

AND  IT  WAS  THE  M-16  RIFLE.  OF  COURSE,  THAT  RIFLE  WAS  PROBABLY 
DESIGNED  WITH  THE  HELP  OF  A  COMPUTER,  SO  IT  MAY  NOT  BE  A  VALID 
EXAMPLE.  AS  A  MEASURE  OF  JUST  HOW  DEPENDENT  WE  ARE  ON  COMPUTERS 
IN  OUR  MILITARY  SYSTEMS,  WE  ESTIMATE  THAT  THE  THREE  SERVICES  HAVE 
185000  COMPUTERS  IN  SERVICE  RIGHT  NOW,  AND  THAT  THE  NUMBER  WILL 


...  ...  ^  A. ' 


DOUBLE  IN  THE  NEXT  FOUR  YEARS.  BY  THE  WAY,  THAT  NUMBER  DOES  NOT 


INCLUDE  MICROPROCESSORS.  THE  AIR  FORCE’S  F-15  EAGLE  FIGHTER 


AIRCRAFT  HAS  OVER  60  MICROPROCESSORS  ON  BOARD. 


IT’S  OBVIOUS  THAT  SINCE  VIRTUALLY  EVERY  DEFENSE  SYSTEM 


DEPENDS  ON  COMPUTERS  TO  GET  THE  JOB  DONE,  THEY  ALSO  DEPEND  ON 


SOFTWARE.  THERE  HAVE  BEEN  GREAT  ADVANCES  IN  COMPUTER  HARDWARE , 


BUT  THE  SOFTWARE  JUST  HAS  NOT  KEPT  UP.  OUR  SYSTEMS  ARE  PLAGUED 


WITH  PROBLEMS  THAT  ARE  ATTRIBUTED  TO  SOFTWARE. 


MUCH  OF  OUR  SOFTWARE  HAS  DEFECTS  THAT  ARE  NOT  FOUND  UNTIL 


THE  SOFTWARE  IS  FIELDED,  WHEN  THE  COST  OF  CORRECTING  THE  ERROR 


CAN  BE  100  TIMES  THAT  TO  CORRECT  AN  ERROR  FOUND  IN  THE  DESIGN  OR 


CODING  PHASE.  ONE  REASON  FOR  THIS  SITUATION  IS  THE  VERY 


COMPLEXITY  OF  THE  SOFTWARE  IN  OUR  MISSION-CRITICAL  SYSTEMS.  DoD 


SYSTEMS  USE  SOME  OF  THE  MOST  COMPLEX  SOFTWARE  EVER  ATTEMPTED.  IT 


IS  VIRTUALLY  IMPOSSIBLE  TO  RIGOROUSLY  TEST  EVERY  SECTION  OF  CODE 


BEFORE  THE  SYSTEM  GETS  FIELDED.  MOST  DOD  SOFTWARE  SYSTEMS  ARE 


OVER  A  MILLION  LINES  OF  SOURCE  CODE.  THAT  MAKES  IT  VIRTUALLY 


IMPOSSIBLE  FOR  ANY  ONE  OR  EVEN  A  SMALL  GROUP  OF  PEOPLE  TO 


UNDERSTAND  WHAT  THE  SYSTEM  IS  DOING.  OTHER  PROBLEMS  ARE  CAUSED 


BY  THE  PREVALENCE  OF  ASSEMBLY  LANGUAGE  IN  MILITARY  SYSTEMS. 


UNTIL  WE  CAN  GET  THE  BULK  OF  OUR  SOFTWARE  DONE  IN  A  MODERN  HIGH 


ORDER  LANGUAGE  THAT  PROVIDES  STRUCTURE  AND  EASE  OF  UNDERSTANDING, 


WE  WILL  CONTINUE  TO  HAVE  HUGE,  POORLY  DOCUMENTED  PROGRAMS  THAT 


ARE  FILLED  WITH  BUGS.  MILITARY  SOFTWARE  IS  STILL  DONE  IN  A 


HIGHLY  LABOR  INTENSIVE  WAY.  UNTIL  WE  CAN  BRING  A  DEGREE  OF 
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AUTOMATION  TO  THE  SOFTWARE  DESIGN,  CODING,  MAINTENANCE  AND 
MANAGEMENT  PROCESSES,  WE  WILL  CONTINUE  TO  LAG  BEHIND  IN  OUR 
EFFORTS  TO  PROVIDE  RELIABLE  SOFTWARE  TO  OUR  FORCES  IN  THE  FIELD. 
THE  PROBLEM  WITH  SOFTWARE  PEOPLE  WILL  ONLY  GET  WORSE.  THE  UNITED 
STATES  AS  A  WHOLE  IS  CURRENTLY  SHORT  50-100,000  SOFTWARE 
PROFESSIONALS.  BY  1990,  THE  SHORTAGE  WILL  GROW  TO  OVER  1 
MILLION.  WE  MUST  INCREASE  THE  PRODUCTIVITY  OF  THESE  SCARCE 
PROFESSIONALS.  WE  HAVE  ANOTHER  SERIOUS  PROBLEM  WITH  OUR 
SOFTWARE.  IT  COSTS  TOO  MUCH.  THIS  YEAR,  DOD  WILL  SPEND  ABOUT 
$10  BILLION  ON  TACTICAL  SOFTWARE  ALONE.  THIS  DOES  NOT  INCLUDE 
THE  SOFTWARE  FOR  OUR  BUSINESS  ADP  SYSTEMS.  WE  ESTIMATE  THAT  BY 
1990,  WE  WILL  SPEND  $30  BILLION  A  YEAR  ON  THE  SOFTWARE  FOR  THESE 
SYSTEMS.  THAT’S  OVER  10%  OF  THE  ENTIRE  DEFENSE  BUDGET,  AND  WE 
CAN’T  AFFORD  IT.  ANOTHER  CONTRIBUTOR  TO  THE  SOFTWARE  CRISIS  IS 
THE  OLD  TECHNOLOGY  WE  USE  IN  DEVELOPING  AND  MAINTAINING  MISSION 
SOFTWARE.  THIS  OLD  TECHNOLOGY  IS  NOT  LIMITED  TO  THE  HIGH  USE  OF 
ASSEMBLY  LANGUAGE,  BUT  SPANS  THE  WHOLE  SOFTWARE  LIFE  CYCLE. 

NOW  THAT  I’VE  GOTTEN  THE  GLOOM  AND  DOOM  OFF  MY  CHEST,  WHAT 
DO  WE  PLAN  TO  DO  ABOUT  IT?  OUR  RESPONSE  TO  THE  SOFTWARE  CRISIS 
IS  THE  DOD  SOFTWARE  INITIATIVE.  THE  INITIATIVE  IS  ORGANIZED  AT 
THE  DOD  LEVEL  BECAUSE  THE  PROBLEM  COVERS  ALL  OF  DOD.  WHEN  YOU 
CONSIDER  DOD  AS  A  WHOLE,  WE  ARE  THE  LARGEST  SINGLE  BUYER  OF 
SOFTWARE  IN  THE  WORLD,  HENCE  WE  CAN  INFLUENCE  THE  MARKET  AND  THE 
STATE  OF  PRACTICE  IF  WE  ACT  TOGETHER.  WE  ALSO  FEEL  THAT  AN 
INITIATIVE  MANAGED  AT  DOD  WILL  PROVIDE  THE  CRITICAL  MASS  OF 
PEOPLE  AND  FUNDING  THAT  THE  PROBLEM  DEMANDS.  THE  GOAL  OF  OUR 


INITIATIVE  IS  TO  DEVELOP  ADVANCED  METHODS,  TECHNIQUES  AND  TOOLS 
TO  REDUCE  COSTS,  SHORTEN  SCHEDULES,  AND  IMPROVE  THE  RELIABILITY 
AND  ADAPTABILITY  OF  FUTURE  MISSION-CRITICAL  SOFTWARE. 

IT  IS  CRITICAL  TO  THE  SUCCESS  OF  THE  DOD  SOFTWARE  INITIATIVE 
THAT  WE  INVOLVE  ALL  OF  THE  SERVICES  AND  AGENCIES.  WE  WILL  FAIL 
IF  WE  ALLOW  THE  INTERESTS  OF  ONE  COMPONENT  TO  PREDOMINATE  AT  THE 
EXPENSE  OF  THE  OTHERS.  WE  MUST  MOTIVATE  INDUSTRY  TO  INVEST  ITS 
SKILLS  AND  FUNDS  IN  THE  SOFTWARE  INITIATIVE  IF  WE  ARE  TO  HAVE  A 
SIGNIFICANT  EFFECT  ON  THE  PROBLEMS  FACING  US.  WE  MUST  ALSO 
COORDINATE  THE  SOFTWARE  INITIATIVE  WITH  OTHER  HIGH  LEVEL  NATIONAL 
INITIATIVES  SUCH  AS  THE  STRATEGIC  COMPUTING  PROGRAM.  AND  THE 
VHSIC  PROGRAM. 

THE  SOFTWARE  INITIATIVE  HAS  THREE  COMPONENTS.  THEY  ARE  THE 
ADA  COMMON  HIGH  ORDER  LANGUAGE  EFFORT,  THE  SOFTWARE  TECHNOLOGY 
FOR  ADAPTABLE,  RELIABLE  SYSTEMS  OR  STARS  PROGRAM  AND  THE  DOD 
SOFTWARE  ENGINEERING  INSTITUTE.  I'LL  NOW  GIVE  A  BRIEF  OVERVIEW 
OF  EACH. 

THE  ADA  PROGRAM  IS  DOD’S  EFFORT  TO  PROVIDE  A  SINGLE, 

POWERFUL  HIGH  ORDER  PROGRAMMING  LANGUAGE  FOR  ALL  MISSION-CRITICAL 
SYSTEMS.  THE  NEED  FOR  A  STANDARD  LANGUAGE  WAS  MOTIVATED  BY  THE 
PROLIFERATION  OF  OVER  400  LANGUAGES  AND  INCOMPATIBLE  DIALECTS  IN 
USE  IN  OUR  SYSTEMS.  THIS  PROLIFERATION  LED  A  POOR  SUPPORT  BASE 
FOR  FIELDED  SOFTWARE  DUE  TO  THE  EXPENSE  OF  SUPPORT  TOOLS 
ADDITIONALLY,  THE  HIGH  ORDER  LANGUAGES  IN  USE  BEFORE  ADA  WERE 
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GENERALLY  UNSATISFACTORY  FOR  REAL  TIME  SYSTEMS.  THE  RESULT  WAS 
HIGH  COSTS,  EXTENDED  SCHEDULES  (ONE  SYSTEM  HAD  A  48  MONTH 
SCHEDULE  EXTENSION  DUE  TO  SOFTWARE),  FAILURES  IN  FIELDED  SYSTEMS, 
AND  VERY  LIMITED  FLEXIBILITY.  THE  DEPARTMENT  THEN  UNDERTOOK  A 
JOINT  EFFORT  TO  DEVELOP  REQUIREMENTS  FOR  A  STANDARD  LANGUAGE, 
DETERMINED  THAT  A  NEW  LANGUAGE  WOULD  BE  NEEDED  AND  EMBARKED  ON 
LANGUAGE  DEVELOPMENT  EFFORT.  IN  AN  EXCEPTIONALLY  OPEN 
DEVELOPMENT  PROCESS,  FOUR  COMPETITIVE  DESIGNS  WERE  NARROWED  TO 
TWO  AND  THEN  TO  ONE  BETWEEN  1977  AND  1980.  IN  THE  PROCESS,  OVER 
7000  COMMENTS  AND  RECOMMENDATIONS  WERE  RECEIVED  FROM  SOFTWARE 
EXPERTS  IN  15  NATIONS.  THE  RESULTING  LANGUAGE  DESIGN  WAS  ADOPTED 
AS  MILITARY  STANDARD  1815  AND  AN  AMERICAN  NATIONAL  STANDARDS 
INSTITUTE  STANDARD  IN  JANUARY  1983.  ADA  HAS  SOME  INTERESTING 
SOFTWARE  ENGINEERING  FEATURES  THAT  MAKE  IT  ESPECIALLY  USEFUL  FOR 
LARGE  SCALE,  LONG  LIVED,  COMPLEX  SYSTEMS.  (REMEMBER  THE  AVERAGE 
LENGTH  OF  OVER  A  MILLION  LINES  OF  SOURCE  CODE  PER  SYSTEM. )  IT  IS 
ORIENTED  TOWARD  SPECIFICATION  AND  DESIGN,  AND  IS  THEREFORE  IN  USE 
IN  MANY  SYSTEMS  AS  A  PROGRAM  DESIGN  LANGUAGE.  IT  IS  STRONGLY 
MODULAR  IN  NATURE  AND  USES  STRUCTURES  CALLED  PACKAGES  AS  SOFTWARE 
BUILDING  BLOCKS  AND  TASKS  TO  SPECIFICALLY  PROMOTE  CONCURRENT 
EXECUTION.  IT  HAS  SPECIFIC  FEATURES  FOR  HANDLING  HARDWARE  OR 
SOFTWARE  FAILURES,  AND  THUS  CAN  BE  USED  TO  PERMIT  GRACEFUL 
DEGRADATION  IN  THE  PRESENCE  OF  COMPONENT  FAILURES,  RATHER  THAN 
TOTAL  SYSTEM  FAILURE.  IT  IS  A  STRONGLY  TYPED  LANGUAGE  AND  HAS 
EXTENSIVE  FEATURES  FOR  COMPILE  TIME  CHECKING.  ITS  MODULES  CAN  BE 
SEPARATELY  COMPILED,  FACILITATING  THE  SOFTWARE  MAINTENANCE 
PROCESS.  IT  ALSO  HAS  SPECIFIC  FEATURES  FOR  INTERACTING  WITH  THE 


ft 

ft 


m 


fS 


Eft 


TARGET  COMPUTER’S  HARDWARE,  THUS  MOVING  AWAY  FROM  THE  NEED  FOR 
ASSEMBLY  LANGUAGE. 

THE  UNDER  SECRETARY  OF  DEFENSE  FOR  RESEARCH  AND  ENGINEERING 
HAS  DIRECTED  THAT  ADA  WILL  BE  USED  AS  THE  PROGRAMMING  LANGUAGE 
FOR  ALL  MISSION-CRITICAL  SYSTEMS.  WE  CONTROL  THE  LANGUAGE  BY 
SEVERAL  MEANS.  FIRST  WE  HAVE  A  REGISTERED  TRADEMARK  FOR  THE 
NAME,  AND  WE  ONLY  ALLOW  USE  OF  THE  NAME  ADA  ON  SOFTWARE  SYSTEMS 
THAT  HAVE  PASSED  THE  ADA  VALIDATION  TESTS.  THESE  TESTS  CONSIST 
OF  OVER  2500  PROGRAMS,  AND  DETERMINE  THAT  A  COMPILER  HAS  FULLY 
IMPLEMENTED  THE  STANDARD  DEFINITION  OF  THE  LANGUAGE.  WE  WILL 
PERMIT  NO  SUBSETS  OR  SUPERSETS  OF  THE  LANGUAGE,  SINCE  EITHER 
WOULD  PUT  US  RIGHT  BACK  WHERE  WE  WERE  WITH  A  HOST  OF  INCOMPATIBLE 
DIALECTS. 

THE  ADA  PROGRAM  HAS  BEEN  A  TREMENDOUS  SUCCESS  AND  WE  ARE 
VERY  PROUD  OF  ITS  ACCOMPLISHMENTS.  IT  HAS  BEEN  SELECTED  FOR  USE 
IN  OVER  130  DEFENSE  SYSTEMS.  IT  HAS  BEEN  ADOPTED  FOR  USE  IN  THE 
DEFENSE  SYSTEMS  OF  THE  UNITED  KINGDOM,  CANADA,  THE  FEDERAL 
REPUBLIC  OF  GERMANY,  SWEDEN,  AND  FOR  COMMAND  AND  CONTROL  SYSTEMS 
IN  THE  NATO  ALLIANCE.  IT  WILL  BE  USED  IN  THE  NASA  SPACE  STATION 
FLIGHT  SOFTWARE,  IN  FAA  UPGRADES,  AND  IN  SIMILAR  EFFORTS  IN  THE 
CANADIAN  AVIATION  ADMINISTRATION.  IT  WILL  BE  USED  IN  BOEING’S  NEW 
AIRPLANE,  THE  7J7.  THIRTEEN  DIFFERENT  COMPANIES  HAVE  DEVELOPED 
29  VALIDATED  COMPILERS.  SOME  OF  THESE  COMPILERS  ARE  COMPETITIVE 


IN  TERMS  OF  SPEED  AND  CODE  EXPANSION  RATIOS  TO  COMPILERS  FOR 
MATURE  LANGUAGES.  THERE  ARE  FIVE  COMPILER  VALIDATION  FACILITIES, 
TWO  IN  THIS  COUNTRY  AND  THREE  IN  EUROPE. 


EXPERIMENTAL  RESULTS  FROM  USE  OF  THE  LANGUAGE  TO  RECODE 
EXISTING  PROGRAMS  HAVE  BEEN  VERY  ENCOURAGING.  THE  FLIGHT  CONTROL 
PROGRAMS  FOR  THE  F-15  EAGLE  AND  F-20  TIGERSHARK  FIGHTER  AIRCRAFT 
WERE  RECODED  IN  ADA,  WITH  DRAMATIC  REDUCTIONS  IN  SOURCE  CODE 
SIZE,  AND  TEST  MISSIONS  SUCCESSFULLY  FLOWN.  THE  UNITREP  PROGRAM 
FROM  THE  WORLD  WIDE  MILITARY  COMMAND  AND  CONTROL  SYSTEM  WAS 
RECODED  FROM  THE  ORIGINAL  COBOL  INTO  ADA.  THE  ORIGINAL  60000 
LINES  OF  COBOL  CODE  WERE  REDUCED  TOO  5000  LINES  OF  ADA.  THE 
STRATEGIC  AIR  COMMAND’S  MOBILE  INFORMATION  MANAGEMENT  SYSTEM  WAS 
RECODED  FROM  JOVIAL  INTO  ADA  WITH  A  REDUCTION  FROM  130,000  LINES 
TO  8100  LINES  OF  CODE.  AN  INTERESTING  OBSERVATION  IS  THAT  OVER 
HALF  OF  THE  COBOL  STATEMENTS  IN  THE  WWMCCS  PROGRAM  WERE  DATA 
CHECKING  STATEMENTS  THAT  ADA’S  STRONG  TYPING  ELIMINATES  WITH 
COMPILE  TIME  CHECKING. 


THE  SECOND  COMPONENT  OF  THE  DOD  SOFTWARE  INITIATIVE  IS 
THE  SOFTWARE  TECHNOLOGY  FOR  ADAPTABLE,  RELIABLE  SYSTEMS,  OR  STARS 
PROGRAM.  IT  SEEKS  TO  IMPROVE  DOD’S  ABILITY  TO  PROVIDE  RELIABLE, 
COST  EFFECTIVE  MISSION-CRITICAL  SOFTWARE  BY  AN  ORDER  OF 
MAGNITUDE.  THE  STARS  EFFORT  WILL  BRING  A  HIGH  DEGREE  OF 
AUTOMATION  TO  THE  SOFTWARE  DEVELOPMENT,  MANAGEMENT  AND 
MAINTENANCE  PROCESSES.  IT  WILL  PROMOTE  THE  REUSE  OF  PROVEN 
PIECES  OF  CODE,  IMPROVE  THE  PRODUCTIVITY  OF  THE  SOFTWARE 
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PROFESSIONAL  WORK  FORCE  BY  AUTOMATION  AND  IMPROVED  TRAINING.  IT 


WILL  IMPROVE  THE  WAY  WE  ACQUIRE  SOFTWARE.  IT  WILL  ALSO  PROMOTE 
RAPID  PROTOTYPING  OF  SOFTWARE  SYSTEMS  TO  BRING  THE  USER, 

DESIGNER  AND  PROGRAMMER  TOGETHER  EARLY  IN  THE  SOFTWARE  LIFE 
CYCLE. 

A  KEY  TO  AUTOMATING  THE  SOFTWARE  FROCESS  IS  THE  USE  OF 
SOFTWARE  ENGINEERING  ENVIRONMENTS,  CALLED  SOFTWARE  FACTORIES  BY 
SOME.  THESE  ENVIRONMENTS  ARE  INTEGRATED  COLLECTIONS  OF  SOFTWARE 
TOOLS  THAT  DO  MUCH  OF  THE  WORK  PREVIOUSLY  DONE  IN  A  MANUAL  AND 
TIME  CONSUMING  WAY  BY  THE  PROGRAMMER.  THE  ENVIRONMENTS  TO  BE 
USED  FOR  DEFENSE  SOFTWARE  MUST  FACILITATE  THE  PORTABILITY  OF 
SOFTWARE  BETWEEN  THE  ENVIRONMENT  USED  BY  THE  CONTRACTOR,  AND  THE 
ENVIRONMENTS  USED  IN  GOVERNMENT  POST  DEPLOYMENT  SUPPORT  CENTERS. 
WE  MUST  BE  ABLE  TO  TRANSPORT  MISSION  SOFTWARE,  SUPPORT  TOOLS,  AND 
PROGRAMMERS,  OR  WE  WILL  BE  UNABLE  TO  MEET  THE  REQUIREMENTS  OF  THE 
90’ S. 


ONE  WAY  TO  ENSURE  THE  TRANSPORTABILITY  OF  SOFTWARE  BETWEEN 
ENVIRONMENTS  IS  TO  INSIST  ON  THE  USE  OF  STANDARD  ENVIRONMENTS. 
THE  USE  OF  A  STANDARD  ENVIRONMENT  HAS  THE  ADVANTAGE  OF  PROVIDING 
THE  PORTABILITY  DESIRED,  BUT  TENDS  TO  STAGNATE  THE  TECHNOLOGY, 
LOCKING  YOU  IN  TO  THE  STANDARD.  IT  ALSO  CARRIES  RISKS  OF  HIGH 

COSTS,  DELAYS  AND  POOR  QUALITY.  IT  WILL  TEND  TO  DISCOURAGE 

INDUSTRY  FROM  INVESTING  IN  THE  TECHNOLOGY,  SINCE  THEY  FAR  PREFER 
TO  USE  THEIR  OWN  PRODUCTS.  THE  OTHER  EXTREME,  THE  LAISSEZ-FAIRE 
OR  NO  STANDARDS  APPROACH  HAS  THE  ADVANTAGE  OF  A  LARGE  NUMBER  OF 
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QUALITY  ENVIRONMENTS  DEVELOPED  BY  INDUSTRY,  BUT  PROVIDES  LIMITED 
TRANSPORTABILITY  BETWEEN  ENVIRONMENTS.  PERHAPS  THE  BEST  APPROACH 


IS  TO  STANDARDIZE  ON  THE  INTERFACES  OF  THE  TOOLS  TO  THE 
ENVIRONMENT,  AND  ON  ADA  AS  THE  BASELINE  FOR  THE  ENVIRONMENTS. 

THIS  PROVIDES  THE  TRANSPORTABILITY  (THROUGH  STANDARD  INTERFACES), 
CREATES  A  ROBUST  MARKETPLACE  FOR  COMPONENTS  OF  THE  ENVIRONMENTS, 
FACILITATES  HIGHER  QUALITY  PRODUCTS,  REDUCES  COST,  PROVIDES  FOR 
CONTINUAL  UPGRADE  OF  THE  ENVIRONMENT  AND  ITS  COMPONENTS  AND  GETS 
THE  PRODUCTS  TO  THE  FIELD  FASTER. 

THE  STARS  PROGRAM  HAS  HAD  SOME  SUCCESS  IN  THE  AREAS  OF 
REUSABILITY  AND  SOFTWARE  ACQUISITION.  IN  SOFTWARE  REUSABILITY, 

WE  SUPPORTED  A  PROGRAM  CALLED  COMMON  ADA  MISSILE  PACKAGES  OR 
CAMP.  THE  CAMP  PEOPLE  FOUND  THAT  THERE  ARE  CERTAIN  PRIMITIVE 
FUNCTIONS  PERFORMED  BY  ALL  MISSILES.  THEY  THEN  CODED  THESE 
PRIMITIVES  IN  ADA  USING  STANDARD  INTERFACE  DEFINITIONS  AND  STORED 
THEM  IN  A  REPOSITORY  OF  REUSABLE  PARTS.  IN  SOFTWARE  ACQUISITION 
WE  HAVE  PROTOTYPED  AND  DEMONSTRATED  A  SOFTWARE  ACQUISITION 
MANAGERS  WORKSTATION. 


THE  THIRD  COMPONENT  OF  THE  SOFTWARE  INITIATIVE  IS  THE 
SOFTWARE  ENGINEERING  INSTITUTE .  IT  TAKES,  ON  THE  AVERAGE  FROM 
15-20  YEARS  TO  GET  A  NEW  SOFTWARE  ENGINEERING  CONCEPT  FROM  THE 
LABORATORY  INTO  USE.  REASONS  FOR  THIS  LONG  PERIOD  OF  TRANSITION 
FOCUS  ON  RELUCTANCE  TO  USE  NEW  IDEAS,  AND  A  DIFFICULTY  IN  SCALING 
UP  THE  CONCEPT  FROM  USE  ON  SMALL  LABORATORY  PROJECTS  TO  USE  IN 
LARGE  SCALE  SYSTEMS.  THE  SOFTWARE  ENGINEERING  INSTITUTE  WAS 
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CREATED  TO  BRING  THE  ABLEST  PROFESSIONAL  MINDS  AND  THE  MOST 
EFFECTIVE  TECHNOLOGY  TO  BEAR  ON  THE  RAPID  IMPROVEMENT  OF  THEW 
QUALITY  OF  SOFTWARE.  IT  WILL  ACCELERATE  THE  TRANSITION  OF  NEW 
TECHNOLOGY  INTO  USE  IN  DEFENSE  SOFTWARE,  PROMOTE  THE  USE  OF 
MODERN  TECHNIQUES  AND  METHODS,  AND  ESTABLISH  STANDARDS  OF 
EXCELLENCE  FOR  SOFTWARE  ENGINEERING  PRACTICE.  IT  IS  A  NEW 
FEDERAL  CONTRACT  RESEARCH  CENTER,  ESTABLISHED  IN  DECEMBER  1985  AT 
CARNEGIE-MELLON  UNIVERSITY.  IN  ITS  FIRST  YEAR,  IT  HAS  STAFFED  UP 
TO  100  PROFESSIONAL  STAFF  MEMBERS,  AND  COMPLETED  SIGNIFICANT 
PROGRAM  PLANNING.  IT  HAS  EVALUATED  SEVERAL  EXISTING  ADA 
ENVIRONMENTS,  COMPLETED  A  RIGOROUS  STUDY  OF  RIGHTS  IN  DATA 
ISSUES,  DEVELOPED  A  MASTER  OF  SOFTWARE  ENGINEERING  CURRICULUM, 
ESTABLISHED  AN  AFFILIATES  PROGRAM  WITH  INDUSTRY  AND  HAS  STARTED 
ON  A  SHOWCASE  SOFTWARE  ENGINEERING  ENVIRONMENT. 

THE  DOD  SOFTWARE  INITIATIVE  HAS  A  VISION  OF  THE  FUTURE, 

WHICH  HAS  DOD  MEETING  ITS  MISSION-CRITICAL  REQUIREMENTS  WITH  THE 
NEEDED  QUALITY,  ON  TIME,  AT  REASONABLE  COST  AND  DOING  SO 
ROUTINELY  AND  PREDICTABILITY.  WE  SEE  MARKET  PLACES  EXISTING  IN 
SOFTWARE  TOOLS,  METHODS  AND  ENVIRONMENTS,  AND  IN  REUSABLE 
SOFTWARE  COMPONENTS.  DOD  WILL  BE  AN  INTELLIGENT  SOFTWARE  BUYER. 
INDUSTRY  AND  THE  DEPARTMENT  WILL  HAVE  THE  ABILITY  TO  RAPIDLY 
ESTABLISH  UP  TO  DATE,  POWERFUL,  INTEGRATED  SOFTWARE  ENGINEERING 
ENVIRONMENTS  AS  THEY  ARE  REQUIRED,  AND  THE  COST  OF 
MISSION-CRITICAL  SOFTWARE  WILL  BE  REDUCED  BY  A  FACTOR  OF  100. 
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WE  ALSO  FACE  EQUALLY  CHALLENGING  PROBLEMS  FOR  THE  FUTURE  IN 
COMPUTER  HARDWARE.  WE  HAVE  VALID  REQUIREMENTS  FOR  COMPUTERS  THAT 
CAN  RUN  AT  SPEEDS  OF  BILLIONS  OF  OPERATIONS  PER  SECOND,  BUT  FIT 
IN  THE  SAME  OR  SMALLER  SPACES  THAN  TODAY’S  SYSTEMS.  THERE  HAVE 
BEEN  TREMENDOUS  ADVANCES  IN  CIRCUITRY  RESULTING  IN  ORDERS  OF 
MAGNITUDE  SPEED  IN  COMPUTER  PERFORMANCE,  BUT  WE  ARE  APPROACHING 
THE  BARRIER  IMPOSED  BY  THE  SPEED  OF  LIGHT  IN  SIGNAL  PROPAGATION 
WITHOUT  ACHIEVING  THE  SPEEDS  WE’LL  NEED  IN  THE  1990 ’S.  THE 
ANSWER  APPEARS  TO  BE  PARALLEL  PROCESSING,  ONE  OF  THE  IMPORTANT 
ISSUES  ADDRESSED  BY  THIS  CONFERENCE.  HOWEVER,  THERE  ARE  SERIOUS, 
OPEN  ISSUES  FACING  US  BEFORE  WE  CAN  GET  PARALLEL  PROCESSING  INTO 


WIDE  USE. 


ONE  OF  THE  MOST  IMPORTANT  QUESTIONS  IS  HOW  DOES  ONE  PROGRAM 
A  PARALLEL  MACHINE  TO  GET  THE  MOST  OUT  OF  IT?  ARE  EXISTING 
LANGUAGES  ADEQUATE?  ADA  HAS  SPECIFIC  FEATURES  TO  SUPPORT 


CONCURRENT  EXECUTION  BY  THE  USE  OF  TASKS  AND  RENDEZVOUS 
STATEMENTS.  WILL  THESE  BE  ADEQUATE?  LISP  HAS  BEEN  EXTENDED  TO 
PROGRAM  THE  CONNECTION  MACHINE,  A  HIGHLY  PROMISING  MACHINE 
DEVELOPED  FOR  DARPA  THAT  FEATURES  64000  PROCESSORS  CONNECTED  BY  A 
DYNAMIC  COMMUNICATIONS  NETWORK.  OTHER  QUESTIONS  ARE  HOW  DO  YOU 
DETECT  THE  OPPORTUNITY  FOR  PARALLELISM  IN  AN  ALGORITHM  AND 
PARTITION  IT  CORRECTLY?  HOW  DO  YOU  CONNECT  THE  PROCESSING 
ELEMENTS  TO  PROVIDE  THE  NEEDED  COORDINATION  WITHOUT  TYING  DOWN 
THE  MACHINE  WITH  THE  COMMUNICATIONS  OVERHEAD?  WILL  WE  BE  ABLE  TO 
RETARGET  EXISTING  SOFTWARE  TO  PARALLEL  MACHINES? 
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BEFORE  WE  GET  TO  THE  TIME  WHEN  WE  HAVE  PARALLEL 
ARCHITECTURES  IN  PLACE  IN  OUR  SYSTEMS,  WE  MUST  BE  CONCERNED  WITH 
WHAT  WE  DO  WITH  THE  ARCHITECTURES  AVAILABLE  TO  US  TODAY.  WE  KNOW 
THAT  INSTRUCTION-SET  ARCHITECTURE  STANDARDIZATION  PROVIDES  FOR 
COMMONALITY  AND  EASE  OF  REPAIR  IN  THE  FIELD,  BUT  IT  IS 
POLITICALLY  UNPOPULAR  BECAUSE  IT  IS  PERCEIVED  AS  DISCOURAGING 
COMPETITION.  IS  THERE  A  WAY  TO  GET  THE  BENEFITS  OF 
INSTRUCTION-SET  STANDARDIZATION  WITHOUT  PAYING  THE  PENALTIES?  ON 
THE  TECHNICAL  SIDE,  WHAT  DO  WE  DO  ABOUT  OUR  ARCHITECTURES?  IS  THE 
REDUCED  INSTRUCTION-SET  COMPUTER  OR  RISC  THE  WAY  TO  GO?  I 
REMEMBER  RECEIVING  A  NASTY  SHOCK  WHEN  THE  ADA  COMPILER  WE  WERE 
FOOLING  AROUND  WITH  ON  THE  NEBULA  ARCHITECTURE  WE  USED  ON  THE 
MILITARY  COMPUTER  FAMILY  PROJECT  ABSOLUTELY  REFUSED  TO  GENERATE 
THE  BEAUTIFUL  INSTRUCTIONS  I  HAD  INSISTED  ON  INCLUDING,  BUT  CHOSE 
THE  PLAIN  OLD  SIMPLE  ONES  INSTEAD.  WE  HAVE  A  SESSION  ON 
INSTRUCTION  SETS  IN  THE  CONFERENCE,  AND  I  LOOK  FORWARD  TO 
PARTICIPATING. 


THIS  CONFERENCE  HAS  AN  OPPORTUNITY  TO  ADDRESS  THE  GREATEST 
CHALLENGES  WE  FACE  IN  PROVIDING  OUR  MILITARY  FORCES  THE 
CAPABILITY  TO  OVERCOME  AN  ENEMY  WHO  OUTNUMBERS  US  AT  LEAST  10  TO 
ONE  IN  ALL  THE  AREAS  THAT  MATTER.  I  CAN  TELL  YOU  AS  AN  ARMY 
OFFICER  THAT  WE  WILL  ONLY  PREVAIL  IF  WE  CAN  DO  MORE  THAN  TALK 
ABOUT  COMBAT  MULTIPLIERS.  OUR  NATIONAL  SECURITY  DEPENDS  ON 
PEOPLE  LIKE  YOU  WHO  CAN  BRING  OUR  GREAT  RESOURCES  IN  TECHNOLOGY 
TO  BEAR  ON  THE  PROBLEMS  WE  FACE. 
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In  previous  publications  (e.g.,  Colwell  et  ad.1),  we  have  attempted  to  present  a  relatively  impartial 
view  of  the  "Reduced  Instruction  Set  Computer  (RISC)/Complex  Instruction  Set  Computer  (CISC)" 
debate.  This  abstract  summarizes  a  few  additional  observations  about  the  RISC/CISC  controversy, 
with  the  objective  of  stimulating  some  thought  in  an  area  where  more  would  clearly  help. 

The  RISC  design  philosophy  includes  some  significant  and  potentially  valuable  challenges  to  many 
of  the  implicit  assumptions  which  have  guided  computer  design  for  many  years.  Examples  of  these 
assumptions  and  challenges  include: 

•  '!f  some  of  the  instructions  are  best  implemented  in  microcode,  then  all  of  them  are  or  at 
least  can  reasonably  be.'  Hard-core  RISC  people  argue  that  all  instructions  should  ex¬ 
ecute  in  a  single  cycle  and  thus  there  should  be  no  microcode  (unless  you  want  to  think 
of  the  machine  language  as  being  elementary  microcode  stored  in  main  memory).  A 
non-dogmatic  view  would  be  that  vertical  microcode  is  frequently  no  longer  cost-effective 
for  simple  instructions,  but  that  in  many  cases  multicycle  instructions  are  appropriate, 
perhaps  some  microcoded  and  some  hardwired,  and  that  complex  instructions  (whether 
microcoded  or  hardwired)  need  not  slow  down  the  execution  of  simple  ones. 

•  'The  dichotomy  between  instruction  set  architecture  and  its  implementation  (which  was 
one  of  the  important  contributions  of  the  IBM  System/360  family)  13  invariably  the  correct 
design  style.’  RISC  designs  deliberately  discard  this  distinction  —  for  example,  making 
implementation  mechanisms  such  as  caches  and  pipelines  explicitly  visible  and  con¬ 
trolled  at  the  instruction  set  architecture  level.  It  is  probably  clear  to  everyone  that  there 
are  certain  performance  advantages  to  he  gained  by  this,  but  it  is  not  obvious  that  these 
always  outweigh  the  disadvantages.  Iri  a  world  where  compatibility  of  system  and  ap¬ 
plication  software  is  a  mainstay,  and  a  family  of  compatible  machines  is  the  primary 
vehicle  for  accomodating  a  wide  range  of  cost/performance  requirements,  and  any  pur- 
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ticuiar  machine  is  expected  to  perform  well  on  a  wide  variety  of  commercial  and  scientific 
applications,  the  RISC  approach  of  creating  every  computer  architecture  and  implemen¬ 
tation  anew  has  problems.  It  is  worth  noting  that  the  DoD  effort  to  have  several  contrac¬ 
tors  competitively  develop  military  computers  based  on  the  MIPS  design  includes  the 
definition  of  a  new  level  of  standardization:  an  intermediate  language  as  the  target  for  al< 
compilation,  which  is  then  translated  into  machine  code  performing  low  level  resource 
management  (e  g.,  pipeline  scheduling,  register  allocation,  cache  control)  optimally  for 
each  different  MIPS  implementation. 


Such  self-examination  brought  on  in  the  computer  systems  design  community  by  the  RISC  school  of 
thought  is  refreshing  and  welcome.  However,  scientific  objectivity  and  rationality  have  been  all  but 
swept  away  in  the  deluge  of  public  relations  fanfare  generated  lately  by  marketing,  media,  and  even 
engineering  advocates  of  RISC’s  (for  various  electro- political  reasons).  The  CISC  camp  has  allowed 
this  to  be  a  rather  one-sided  debate  by  largely  maintaining  its  unfortunate  historic  low  visibility  of 
rationale,  and  by  having  a  less  coherent  position  not  readily  articulated  in  slogans.  Some  activists  in 
the  RISC  movement  apparantly  are  deliberately  attempting  to  polarize  the  situation,  but  we  believe 
that  there  is  a  silent  majority  of  computer  systems  designers  who  are  wisely  and  impartially  awaiting 
sufficient  credible  evidence  before  adopting  positions  appropriate  for  their  circumstances  inbetween 
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The  theologically  purist  RISC  researchers  have  put  forth  not  only  some  stimulating  departures  from 
orthodoxy,  but  also  some  misleading  assertions  about  both  their  approaches  and  their  results.  Many 
RISC  study  publications  have  left  much  to  be  desired  in  scientific  and  engineering  credibility,  perhaps 
constituting  a  local  minimum  in  the  area  of  computer  architecture;  even  if  all  their  assertions  were 
eventually  to  be  proven  totally  correct,  that  would  not  excuse  the  poor  quality  of  investigation  and 
reporting.  It  is  peculiar  that  high  quality  products  like  the  Encyclopedia  Britannica  and  Kirby  vacuum 
cleaners  are  marketed  using  questionable  tactics;  the  most  skeptical  observers  no  doubt  believe  that 
the  RISC  viewpoint  deserves  better  treatment  from  its  proponents  and  the  media. 

A  few  computer  corporations  are  employing  a  small  subset  of  carefully  selected  RISC  perspectives 
in  what,  for  a  variety  of  technical  and  nontechnical  reasons,  may  eventually  prove  to  be  technically 
and  economically  visible  products.  Then  both  the  researchers  and  the  companies  make  exagerated 
claims  that  the  work  of  each  validates  that  of  the  other. 


It  is  revealing  to  compare:  the  Stanford  MIPS  with  its  MIPS  Computer  Corp.  and  DoD  contractor 
(Honeywell,  RCA,  Rockwell)  counterparts:  the  IBM  801  with  the  RT  PC  and  ROMP  chip  inside,  the 
Berkeley  RISC  I  and  II  with  any  commercially  available  RISC  products  such  as  the  HP  Spectrum;  and 
I  ho  marketing  hype- "RISC"  systems  from  Bidge,  Pyramid,  Convex,  Sperry.  Harris,  and  others  with  any 
-f  the  three  re. nearing  RISC  research  machines. 


Some  of  the  central  >e.ncU>  of  the  RISC  dogma  are  being  reflected  in  few  if  any  products 
(unsurprisingly,  since  little  conclusive  evidence  for  their  efficacy  has  yet  appeared,  at  least  in  public). 
One  of  the  positive  contributions  of  RISC  work  is  renewed  attention  to  the  performance  of  the  most 
frequently  executed  instructions,  which  tend  to  be  simple  register-to- register  operations;  but  it  is  an 
unjustified  extremist  position  that  only  those  instructions  should  be  included.  It  is  not  intrinsically  or 
even  obviously  true  that  single-cycle  execution  or  frequency  of  use  are  necessarily  the  (much  less  the 
only)  proper  criteria  for  inclusion  in  an  instruction  set.  Frequency  of  use  alone  does  not  determine 
throughput  —  an  infrequently  used  but  very  lengthy  instruction  can  become  the  performance  bot¬ 
tleneck  ( cf  the  DEC  vax  MOVC).  Nor  does  frequency  of  use  alone  determine  response  time  —  it  may 
be  critical  that  an  infrequently  used  instruction  execute  quickly  when  needed  (e.g.,  in  the  event  of  a 
reactor  overtemperature  alarm).  Instruction  importance  should  determine  what  is  optimized  in  the 
instruction  set  architecture,  even  though  it  is  harder  to  measure  and  context-dependent  (not  to  men¬ 
tion  more  difficult  to  make  simplistic  generalizations  about).  Instructions  should  be  included  as 
justified  by  cost-effectiveness  tradeoffs  to  meet  current  system  requirements  —  for  example:  perfor¬ 
mance  (e.g.,  algorithmic  specialization,  not  only  for  the  intended  application2  but  also  for  the  system, 
such  as  object  invocation,  interprocess  communication,  real-time  scheduling);  fault  tolerance  (e.g., 
support  for  atomic  transactions,  replication);  software  compatibility. 

Most,  if  not  all,  of  the  new  commercial  computer  products  claimed  to  be  "RISC's"  incorporate  a 
(sometimes  considerable)  number  of  less  frequently  used  instructions  —  directly,  or  indirectly  (cf  the 
Fairchild  Clipper  "macroinstruction  unit"),  or  via  "extended  function  unit"  and  co  processor  inter¬ 
faces  (cf  the  HP  Spectrum  machines).  Such  interfaces  lead  in  the  direction  of  the  tail  wagging  the 
dog,  where  the  best  approach  is  for  there  to  be  a  RISC  co-processor  in  a  CISC. 

Some  of  the  most  important  design  features  in  the  alleged  RISC  products  have  little  or  nothing  to  do 
with  RISC's  and  obviously  apply  just  as  well  to,  and  have  already  been  used  in,  CISC's.  Examples  of 
these  from  the  RISC  literature  include:  register  organization  (e  g.,  overlapped  register  windows); 
careful  assignment  of  complexity  across  different  system  levels;  re-use  of  information;  use  of  storage 
hierarchies;  re-targetable  operating  systems  (e.g.,  Unix);  simple  instructions  execute  quickly;  operat¬ 
ing  system  involvement  in  pipeline  and  cache  management:  a’  priori  (i.e.,  compiler)  pipeline  schedul¬ 
ing.  However,  as  mentioned  above,  the  RISC  philosophy  has  indeed  revitalized  some  of  these  topics. 

The  RISC  school  of  thought  has  interesting  dependencies  on  semiconductor  technology.  For  ex¬ 
ample,  it  would  appear  to  be  ideal  for  the  current  low  density  available  in  GaAs;  but  RISC  single  cycle 

2 

ct  the  SDIO  Eastport  Group  f(eport  rccoinniondalion  for  higher  level.  algorithmically  socialized,  non  numeric  functions 


19 


instructions  (although  the  published,  and  expected,  averages  are  in  the  several  cycle  range  due  to 
branches,  cache  misses,  etc.)  place  excessive  demands  on  the  scarce  processor/memory 
bandwidth,  exacerbated  in  microprocessors  by  pin  limitations  at  that  interface.  Since  the  RISC 
operators  and  operands  are  so  trivial,  keeping  the  processor  busy  becomes  more  difficult  as  it  in¬ 
creases  in  speed.  The  number  of  devices  possible  and  sufficient  for  a  RISC  processor  chip  are  only 
adequate  to  implement  a  RAM  chip  which  is  so  small  that  many  of  them  are  required  in  a  cache  large 
enough  to  keep  up  with  the  processor.  While  a  system  may  be  able  to  afford  one  expensive  GaAs 
processor  chip,  it  may  not  be  able  to  afford  many  expensive  GaAs  cache  RAM  chips.  As  density  rises, 
more  cache  can  be  migrated  on-chip,  but  probably  never  enough  —  consider  what  would  be  needed 
to  support  a  500  MHz  RISC  processor. 

Many  of  the  fans  and  media  publicists  on  the  RISC  bandwagon  tend  to  take  a  sadly  binary  view  of 
this  controversy:  you  are  either  for  RISC's  or  against  them,  there's  no  room  for  thoughtful  skepticism 
and  a  rational  middle  ground.  This  attitude  was  expressed  by  overzealous  political  activists  in  the 
1970’s  as  "Either  you  are  part  of  the  solution  or  you  are  part  of  the  problem".  Religions,  especially 
the  most  fundamentalist  ones,  have  always  insisted  that  you  are  either  at  peace  or  at  war  with  their 
ueily.  Aii  three  oi  these  groups  erroneously  believe  that  this  author  is  un  iiie  wiOny  slue  of  their 
causes.  I  agree  with  H.L.  Mencken  on  the  following  (among  other  things): 

"For  every  complex  problem 

there  is  a  simple  solution... 


and  it  doesn't  work." 
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ABSTRACT:  We  describe  a  static  analysis  of  the  instructions  used  to  implement  ail  the  system  software  on  the 
Lilith  computer.  The  results  are  compared  with  a  similar  analysis  performed  on  the 'Mesa  instruction  set 
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1.  Introduction 

The  Lilith.mmp(multiple  modula-2  processors)  project  is  centered  around  extensions  to  the 
Lilith  computer[l,2),  designed  and  implemented  by  Professor  Niklaus  Winh[21  at  the  Swiss 
Federal  Institute  of  Technology-Zurich  (ETH).  Wirth  has  refined  the  Lilith  into  a  powerful 
single-user  workstation.  A  network  of  over  eighty  Liliths  is  now  in  use  at  the  ETH  for  research 
and  teaching  purposes. 

As  a  prerequisite  to  the  design  of  a  second  version  of  the  Lilith  architecture,  we  have 
undertaken  an  analysis  of  the  effectiveness  of  the  current  instruction  sen  This  paper  describes 
the  results  of  the  first  phase  of  that  effort,  which  is  a  static  analysis  of  the  current  Lilith  software 
environment  This  work  is  part  of  a  continuing  set  of  experiments[3-5]  with  problem-oriented 
languages  and  language-oriented  architectures.  The  data  serves  as  an  independent  corroboration 
of  the  experiments  conducted  at  Xerox  PARC  for  the  Mesa  instruction  set[6j.  In  addition, 
information  is  provided  to  guide  architects  of  future  high-level  langauge  machines,  hinally,  some 
cf  the  experiments  proposed  for  the  Lilith.mmp  project  are  discussed. 

The  code  analyzed  consisted  of  the  Medcs-2  operating  system,  the  Moduli  compiler, 
and  all  other  system  software,  which  included  text  editors,  document  processors,  window- 
packages,  the  I/O  library,  and  numerous  other  modules.  The  software  was  composed  from  180 
modules  with  2,236  procedures,  comprised  of  146,293  instructions. 

We  point  out  that  a  static  analysis  has  the  most  impact  on  reducing  the  size  of  a  program’s 
object  code,  while  dynamic  statistics  help  most  in  the  area  of  execution  speed.  For  example,  the 
BitBIockTransfer(BBLT)  instruction  that  manipulates  bit  maps  was  only  used  14  times;  however, 
one  has  only  to  use  the  Lilith's  windowing  package  to  appreciate  the  advantages  of  including 
BBLT  in  the  instruction  sen 

2.  The  Lilith  Architecture 

The  Lilith  was  designed[2]  to  support  the  Modula-2  programming  language;  in  fact,  all 
programming  on  the  Lilith  is  in  Modula-2.  As  a  result,  the  hardware  organization  is  optimized  tc 
support  frequently  occurring  operations  in  the  language.  The  Lilith  is  a  16-bit  architecture; 
physical  memory  can  be  of  an  arbitrary  size  but  each  process’  address  space  is  restricted  to  64 K 
words.  A  16-word  stack  is  used  for  expression  evaluation  and  argument  passing.  The  evaluation 
stack  is  not  checked  for  overflow  as  the  compiler  is  expected  to  generate  code  in  such  a  wav  that 
overflow  is  avoided.  Figure  1  illustrates  the  format  of  the  hardw'are  state  vector  as  well  as  the 
layout  of  modules,  module-global  data,  code,  and  the  activation  record(frame)  stack. 

The  address  of  the  currently  executing  Lilith  process  is  contained  in  the  P  register.  A 
process  is  usually  stored  as  a  sequential  image.  The  first  record  in  a  process'  image  is  used  as  a 
save  area  (when  the  process  is  not  executing)  for  the  base  register  portion  of  the  hardware  state 
vector.  The  "in  use"  registers  in  the  evaluation  stack  are  saved  on  the  top  of  the  activation 
record(procedure  frame)  stack  together  with  a  count  of  their  number.  The  activation  record  stack 
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follows  the  save  area  in  a  process’  image. 

When  a  process  is  executing,  the  L  register  contains  the  address  of  the  activation  record  for 
the  current  procedure,  the  S  register  points  to  the  stack  top  for  that  execution,  and  the  H  register 
determines  the  stack  limit  address.  The  L  register  is  also  used  as  a  base  register  for  accesses  to  the 
local  variables  of  a  procedure. 

A  procedure  variable  on  the  Lilith  is  represented  as  an  8-bit  module  number  and  an  8-bit 
procedure  number  within  the  selected  module.  On  a  procedure  invocation,  the  module  number  is 
used  as  an  index  into  the  module  frame  table,  which  is  a  vector  of  relocation  bases  for  modules. 
The  module  frame  table  is  typically  shared  among  all  processes.  The  address  in  the  module 
frame  table  entry  points  to  the  global  data  frame  of  the  selected  module  and  is  used  to  initialize 
the  G  register.  The  G  register  is  used  as  a  base  register  for  accesses  to  the  global  variables  of  a 
module. 

The  First  word  of  the  global  data  frame  contains  the  address  of  the  code  fram e( insure c :i o ns ) 
for  a  module.  In  order  to  maximize  the  use  of  the  data  memory(the  lower  64  K.  words),  the  value 
is  multiplied  by  two  to  yield  a  17-bit  address;  thus,  code  frames  can  be  relocated  anywhere  in  the 
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first  128K  words  of  memory.  The  resulting  address  is  then  loaded  into  the  F  register,  which 
serves  as  the  base  register  to  the  instructions  for  a  module. 

On  a  procedure  call,  the  8-bit  module  number  is  used  to  select  the  global  data  and  code 
frames;  next,  the  8 -bit  procedure  number  is  used  to  index  an  offset  vector  that  is  stored  at  the 
beginning  of  the  code  frame.  The  offset  corresponding  to  the  procedure  number  is  then  used  to 
initialize  the  program  counter  register(PC).  By  convention,  procedure  zero  holds  the  instructions 
for  the  initialization  code  of  a  module.  Also  by  convention,  procedure  zero  is  expected  to  set  the 
second  word  of  the  global  data  frame  to  TRUE  to  indicate  that  module  initialization  is  in 
progress. 

In  summary,  there  are  three  important  base  registers  on  the  Lilith— G,  L,  and  F.  G  locates 
a  module’s  global  variables,  L  locates  a  procedure’s  local  variables,  and  F  locates  a  module’s 
instructions. 

LI  The  Lilith  instruction  set 

The  Lilith  uses  the  16-word  evaluation  stack  to  compute  expressions.  The  data  types  that 
are  supported  include  BOOLEAN,  BITSET,  CARDINAL,  INTEGER,  CHAR,  and  REAL,  Figure 
2  illustrates  the  instruction  formats  and  addressing  modes  for  the  Lilith.  Only  50  instructions  out 
of  the  256  possible  opcodes  occupy  more  than  a  single  byte.  Of  the  206  "short"  instructions,  115 
are  of  the  second  format;  that  is,  a  4-bit  opcode  and  a  4-bit  selection  value.  The  format-two 
instructions  were  allocated  to  what  was  assumed  to  be  the  most  frequently  occurring  operations; 
e.g.  immediate(-1..15),  load/store  local(0..11),  load/store  global(2..15),  load/store  evaluation 
stack  indirect(0..15),  and  local  procedure  calis(1..15). 


If  the  instructions  were  reasonably  matched  to  Modula-2's  needs,  one  would  expect 
compact  program  representations.  This  expectation  certainly  seems  to  be  realized  given  the  small 
size  of  the  system’s  software.  Also,  Wirth[8]  retargeted  the  code  generator  of  the  Modula-2 
compiler  for  both  the  National  Semiconductor  32000  and  the  Motorola  68000. 

“The  compiler  program  is  14%  longer  for  NS  and  56%  longer  for  MC  than  for  Lilith.  If  consider  the 
code  generator  parts  oruy,  the  respective  figures  are  37%  and  154%.  But  most  disappointingly,  the  reward  for  ail 
these  efforts  and  expenses  appears  as  negatne:  for  the  same  programs,  the  compiler  code  is  about  50%  longer  for 
NS  and  130%  longer  for  MC  than  for  Lilith.  The  cumulative  effect  of  having  a  more  complicated  compiling 
algorithm  applied  to  a  less  effective  architecture  results  in  the  compiler  for  the  NS  being  L8  times  more 
voluminous  than  that  for  Lilith,  whereas  the  compiler  for  the  MC  is  3 3  times  as  long." 

3.  A  Comparison  With  The  Mesa  Architecture 

The  Lilith  architecture  was  inspired  by  the  Mesa  instruction  set  architecture  for  the 
XEROX  PARC  Alto,  which  was  later  refined  in  designing  the  Dorado,  and  is  currently  being 
refined  again  as  part  of  the  Dragon  project[7].  The  comparisons,  described  later  in  the  paper,  are 
with  the  analysis  by  Sweet  and  Sandman[6]  of  ail  the  software  running  under  Pilot,  the  Mesa 
operating  system.  They  converted  the  existing  object  code  into  a  canonical  form.  This  included 
breaking  the  code  into  straight  line  sequences,  and  undoing  most  peephole  optimizations.  The 
intent  was  to  remove  any  biases  introduced  by  the  compiler’s  code  generation  strategies.  Their 
sampling  resulted  in  2.5  million  bytes  of  normalized  instructions. 

The  instruction  sets  for  the  Lilith  and  Mesa  architectures  are  very  similar.  For  example, 
both  use  the  G,  L  and  F  registers  in  an  identical  fashion  and  both  maintain  a  module  frame  table. 
There  are  a  few  differences,  however.  Mesa  stores  constants  in  a  procedure’s  code  frame;  this  is 
not  possible  on  the  Lilith.  Another  difference  arises  in  references  to  symbols  imported  from 
another  module. 

To  refer  to  external  variables  or  procedures  on  the  Lilith,  an  instruciicn  mur.  include  the 
symbol’s  module  number  and  an  offset  The  address  is  then  generated  by  indexing  the  module 
fra  trip,  table.  In  the  Mesa  architecture,  each  module  has  a  link  area,  which  is  initialized  with  the 
module  number  and  offset  of  external  symbols.  As  a  result,  external  procedure  calls  require  one 
more  level  of  indirection  than  on  the  Lilith.  However,  an  external  reference  on  the  Lilith  is 
encoded  as  a  three-byte  instruction;  whereas  on  the  Alto,  it  would  normally  occupy  only  a  single 
byte. 

Figure  3  compares  the  opcode  distribution  on  the  Lilith  with  that  of  the  Mesa  architecture. 
The  categories  are  derived  from  the  normalized  data  presented  by  Swee:  and  Sancman[6].  Tie 
data  provides  a  good  illustration  of  how  code  generation  strategies  and  language  -usage  can  afreet 
opcode  statistics.  With  respect  to  the  Mesa  statistics,  we  omitted  JNE/JE(3.94%)  because  there 
was  no  direct  comparison  with  the  Lilith’s  JPFC/BC(4.52%).  There  were  also  some 
frequently-occurring  Lilith  inStrUCtions(CHKZ(checi  zero onpn  array  bound),  ENTRlecier  procedure), 
GBl(g«.  l  from  nested  procedure))  that  did  not  have  Mesa  counterparts.  Even  though  the  mated  is  net 
exact,  the  sets  of  instructions  are  quite  comparable.  Next,  we  discuss  some  of  the  causes 
attributing  to  the  differences. 

First,  the  Lilith  statistics  indicated  a  greater  frequency  ofloads  on  local  variables  than  fer 
Mesa.  A  contributing  factor  is  the  use  of  peephole  optimization  by  the  Mesa  compiler,  thus, 
"SLWi  T  JLWi"  sequences  are  converted  to  the  equivalent  of  “COPT  SLWi"  to  save  a  memcrv 
reference.  As  Figure  3  illustrates,  the  COPTfcopy  iop  of  sue*)  instruction  is  heavily  used  by  the  Mesa 
compiler.  The  Mesa  softw'are  also  made  much  greater  use  of  doubleword  variables  than  the  Lilith 
software(the  Modula-2  compiler  did  not  support  the  LONGINT  type).  Other  differences 
occurred  in  the  number  of  procedure  calls(more  on  the  Lilith)  and  in  the  frequency  of  use  of  the 
"Load  Local  Address"  instruction(more  by  Mesa).  Again,  code  generation  strategies  have  an 
effect.  The  Lilith  generates  an  address  cell  for  every  local  array  or  record.  As  a  result,  it  is  difneuit 
to  determine  whether  a  "Load  Local"  is  loading  an  address  or  a  value.  The  use  of  an  address  cell 
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Opcode  Distribution  for  The  Lilith  and  Alto 

Lilith  Lilith  %  of  All  %  of  All 


Mnemonic  Injunction  DeicrlpUon  Count  Injunction*  For  the  Alto 


LI 

Load  Immediate 

26534 

ie .  13 

16.58 

LLV 

Load  Local  Vord 

22354 

15.21 

12 .69 

SLV 

Store  local  variable 

6752 

4  .62 

6.57 

COPT 

Duplicate  top  of  stack 

ie23 

e .  70 

5.16 

LLD 

Load  Local  Doubleword 

412 

6.27 

5.65 

CX 

Call  external  procedure 

6318 

5.65 

4.52 

JP,  EXC 

Unconditional  Dump 

4852 

3.32 

4.18 

L/SSV8 

Dereference  pointer  on  T3S 

2551 

2.42 

3.42 

SLD 

Store  Local  Doubleword 

35  B 

8.24 

j.es 

LLA 

Load  Local  variable's  Addr 

1465 

e.95 

2  36 

UtDO 

CARDIOL  add 

3558 

2.52 

2.34 

RET 

Procedure  return 

2668 

1.57 

1.55 

IGV 

Load  module  global  variable 

5366 

£.41 

mu 

CL 

Call  local  procedure 

4584 

3  13 

ISkH 

DADD 

Double  add 

_ 

2 

e  ee 

n 

65 .78*1 

73.3  tr. 

Figure  3 

increases  the  usage  of  the  "short"  opcodes  while  requiring  more  frame  stack  space  at  runtime. 
Another  significant  difference  arises  in  the  use  of  global  variables  by  the  Lilith’s  programmers. 

3.1  Module  and  procedure  statistics 

Ficure  4  summarizes  the  module  and  Drocedure  statistics  that  were  collected.  Most  are 

* 

self  explanatory.  The  "minimum  path  length"  measurements  represent  our  attempt  to  quantify 
how  long  a  procedure  tends  to  execute,  when  it  is  first  invoked,  before  transferring  to  another 
procedure.  The  idea  was  to  form  an  estimate  of  how  much  time  was  available  to  overlap  frame 
initialization  and  execution.  For  example,  the  saving  of  the  return  address  need  not  occur  before 
a  procedure  begins  execution  as  long  is  it  has  completed  before  the  procedure  exits.  The  path 
length  is  measured  in  instructions  and  takes  into  account  all  possible  execution  paths  from  a 
procedure's  entry  point _ _____ 

Module  and  Procedure  Statistics 
For 

The  Lilith 


Sumf'.&Td 

DescripUon  _ Averse  DerlaUon 


Minimum  path  length  to  ext.  call 

19.91 

64. 18 

Minimum  path  to  any  proc  .  call 

16  31 

76.81 

Procedure  calls  per  module 

74.63 

ies.48 

Procedure  calls  per  procedure 

6.13 

35.66 

Local  calls  per  module 

25.96 

57.69 

Local  calls  per  procedure 

2.45 

16  40 

Procedures  per  module 

12  21 

13  63 

Imported  modjles  per  nodule 

5  25 

2 .53 

Instructions  per  medule 

611  44 

971  54 

Instructions  per  procedure 

65  45 

255.52 

Data  sloe  per  mocule  (words) 

35.72 

45  52 

K'orps  reserved  bv  Ek'TR /o-ccedure 

2  01 

35 

Figure  4 


The  data  sizes  listed  include  the  number  of  words  per  module  and  the  number  of  worcs 


allocated  on  the  stack  frame  on  procedure  entry.  The  stack  frame  word  count  includes  only  one  or 
two  cells  per  variable;  thus  for  arrays  and  records,  the  address  cell  was  counted  but  not  the  space 
for  the  data  structure.  The  count  does  include  parameters,  however.  On  the  Lilith,  arguments 
are  passed  by  pushing  their  address  or  value  on  the  evaluation  stack.  In  the  body  of  a  called 
procedure,  then,  the  arguments  are  copied  into  local  parameter  cells.  As  a  result,  both 
parameters  and  local  variables  are  addressed  by  using  a  positive  offset  from  the  L  register. 

4.  T  .fifth  mmp,  Directions  For  The  Future 

The  Lilith.mmp  project  involves  a  set  of  experiments  that  are  intended  to  guide  the  design 
of  a  tightly-coupled,  multiple-processor,  high-level  language  workstation.  In  this  paper,  we  are 
concerned  only  with  instruction  set  design.  The  goals  of  Lilith.mmp  are  as  follows: 

Goals  of  I  -fifth  mmp 

1.  Integrated  support  for  lightweight  processes.  4.  High-speed  context  switch. 

2.  Graphics  performance  2S  good  as  the  Lilith.  5.  Multiple  instruction  set  processors. 

3.  Construction  of  a  Balanced  Instruction  Set  Computer. 

Lilith.mmp  is  intended  for  applications  that  have  a  requirement  for  both  large  numbers  of 
processes  and  a  fast  context  switch  time.  As  a  result,  we  must  find  a  feasible  alternative  to  the 
register  window  schemes  that  are  currently  in  vogue.  The  reason  is  that  saving  hundreds  of 
window  registers  on  a  context  switch  is  slow.  A  lightweight  process  is  characterized  by  a  small 
context  block  and  is  typically  created  using  Modula-2’s  InitCoroutine  primitive.  A  heavyweight 
process  would  be  created  using  the  equivalent  of  the  UNIX  "fork"  or  "exec"  system  calls. 

.Also,  while  we  ascribe  to  the  belief  that  it  is  advantageous  to  -use  pipelining  in  an  attempt 
to  execute  an  instruction  nearly  even-  machine  cycle,  we  are  unwilling  to  jump  on  the  Reduces 
Instruction  bandwagon.  The  Lilith  is  justly  famous  for  its  code  density,  an  advantage  that  we  are 
loath  to  discard-  The  final  three  sections  of  the  paper  present  some  of  our  preliminary  ideas, 
which  must  still  be  tested,  with  regard  to  code  density,  balanced  instruction  processing,  and  fast 
procedure  calls. 

4.1  Code  density 

A  modem  processor  must  be  capable  of  supporting  not  only  a  variety  of  data 
types(CHAR,  CARDINAL,  REAL),  but  also  a  number  of  different  widths(16,  32.  64  bits)  for 
each  type.  Typically,  it  is  most  advantageous  to  orient  the  ALU  and  data  types  toward  the  most 
frequent  width,  say  32  bits,  and  then  to  implement  loads  and  stores  for  the  ether  widths  tc 
perform  either  extension  or  repetition,  as  appropriate.  That  is,  a  16-bit  load  would  sign-extend  to 
32  bits  and  a  64-bit  load  would  be  implemented  as  two  32-bit  loads. 

The  problem  with  respect  to  code  density  is  how  to  encode  all  of  the  different  times  and 
widths  efficiently.  Even  though  our  static  analysis  indicated  a  low  usage  for  the  READ  type,  it 
cannot  be  disregarded  when  allocating  opcode  space.  When  2  program  is  executing  routines  in 
the  MathLib  module,  it  should  benefit  from  the  same  code  density  as  would  be  found  in  a 
CARDINAL  son  procedure.  The  instruction  set  designer,  then,  is  faced  with  a  choice  between 
giving  both  types  equal  opcode  space  and  playing  favorites. 

Our  solution  to  this  dilemma  is  to  overload  opcodes;  that  is,  each  type  is  characterized  by  a 
certain  number  of  widths,  immediate  constants,  and  operators.  The  assignment  of  these  opc one 
values  is  static  and  is  shared  by  all  types.  In  our  first  experiments^],  the  type  associated  with  the 
opcode  set  was  selected  by  means  of  a  MODE  instruction.  This  instruction  can  be  envisioned  as 
setting  a  base  register  that  identifies  the  associated  microcode  or  that  selects  a  functional  unit, 
such  as  a  floating-point  coprocessor. 
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The  advantage  of  the  MODE  approach  is  that  it  allows  a  static  set  of  opcodes  to  be  reused 
for  arbitrarily  many  types.  As  a  result,  all  types  share  equally  in  the  code  density  advantages  of 
the  opcode  assignment.  Furthermore,  the  MODE  approach  avoids  retrofitting  the  insmuction  set 
over  a  machine’s  lifetime  as  new  types  are  a'dded. 

Chong[5],  in  an  extensive  analysis,  used  a  Modula(not  Modula-2)  compiler  to  generate  a 
modified  MCODE  that  included  a  MODE  instruction.  The  static  analysis  showed  that  8.5%  of  all 
instructions  were  MODE  settings,  while  the  dynamic  analysis  demonstrated  that  MODES  made 
up  13%  of  all  instructions  executed. 

42  Balanced  instruction  set  processing 

One  of  the  most  cogent  principles  of  the  RISC  design  philosophy  is  the  so-called  N  +  l 
principle;  that  is,  beware  of  adding  an  additional  instruction  if  its  implementation  slows  the 
execution  of  the  N  existing  instructions.  Adherence  to  the  N  +  l  principle  leads  one  to  the 
conclusion  that  Reduced  Instruction  Set  Computers  are  an  appropriate  design  goal  More 
complex  instructions  are  then  coded  as  subroutines.  This  results  in  a  space  penalty  as  well  as  a 
speed  penalty.  The  graphics  on  the  Lilith  provide  a  good  example  of  the  difference  between 
BBLT(Bit  BLock  Transfer)  in  microcode  and  as  a  subroutine. 

Subroutines  require  frame  space  and  concentrate  memory  references  in  accesses  to  local 
variables  and  parameters.  We  propose  a  middle  ground  between  a  subroutine  and  a  RISC 
instruction;  e.g.  a  subroutine  that  can  use  the  evaluation  stack  for  storage  of  its  local  variables. 
The  RISC  unit  would  then  be  considered  as  a  primitive  execution  engine,  or  E-Unit.  It  is 
conceivable  that  there  might  be  one  EU  per  type;  thus,  EUs  could  be  mixed  and  matched  for 
different  applications.  The  current  MODE  setting  would  ^prerrrvnp  nprrii-ui-r  wc  ■  to 
"applied"  to  the  evaluation  stack. 

A  Balanced  Instruction  Set  Computer  would  then  be  composed  from  RISC  execution  unit: 
and  firmware  subroutines  stored  in  ROM.  The  reason  that  we  propose  multiple  RISC  execution 
units  is  due  to  the  N  +  l  rule.  As  we  said,  there  could  be  a  RISC  EU  per  type.  Most 
manufacturers  are  currently  using  this  approach  for  floating  point,  for  example.  The  other 
advantage  of  multiple  EUs  is  the  ability'  to  have  them  operating  in  parallel.  It  would  be 
advantageous,  for  instance,  to  be  able  to  overlap  the  execution  of  a  graphics  operator  with  the 
execution  of  a  floating  point  instruction. 

The  use  of  multiple  EUs  also  leads  to  the  notion  of  on  demand  context  switching.  In  this 
strategy,  an  EU  is  switched  when  a  process  that  is  different  from  its  owner  request  its  'use.  At 
this  point,  the  EU  must  complete  its  current  instruction,  save  its  state  in  its  owner's  context  biock, 
and  then  give  control  to  the  new  process.  The  alternative  to  on  demand  context  switching  is  to 
wait  until  all  EUs  can  be  halted  before  making  a  context  switch.  With  this  strategy,  the  context 
switch  time  is  bounded  by  the  execution  time  of  the  longest  instruction  in  any  of  the  EUs. 

If  BBLT,  polynomial  evaluation,  bit  map  searching  and  all  the  other  low-static  count 
instructions  are  really  necessary  and  useful  (that  is,  they  have  a  high  dynaraic-count/execution 
time  product),  how  do  we  go  about  implementing  ROM-based  subroutines  that  can  allocate 
frame  space  on  the  evaluation  stack?  Basically,  we  need  an  instruction  that  accesses  a  stack 
register  using  an  index  that  is  relative  to  the  top  of  the  evaluation  stack.  The  index  has  to  be 
relative  because  a  firmware  subroutine  has  no  way  of  know-mg  which  registers  are  available  when 
it  starts  execution. 

Each  firmware  subroutine  allocates  its  local  variables  on  the  evaluation  stack  on  entry. 
Expression  evaluation  then  uses  the  top  of  the  register  stack  as  before.  The  only  difficulty  is  that 
each  subroutine’s  local  variables  may  be  referenced  with  a  Dumber  of  different  offsets  as  the 
stack  space  used  for  an  expression  can  vary  from  one  statement  to  another.  The  compiler  can 
easily  keep  track  of  this  bookkeeping  detail. 
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Being  able  to  refer  to  registers  in  the  evaluation  stack  also  has  advantages  in  other  areas. 
First,  it  is  no  longer  necessary  to  copy  the  arguments  to  local  parameter  cells  for  firmware 
subroutines;  they  are  already  in  the  correct  location.  As  another  example,  the  Modula-2  compiler 
implements  WITH  references  by  storing  the  generated  address  in  a  local  variable,  which  must  be 
reloaded  on  each  reference  to  the  WITH  record.  With  the  proposed  architecture,  the  address 
could  be  saved  on  the  evaluation  stack  and  then  be  "brought  up"  by  indexing.  The  index  option 
can  also  be  used  to  retrieve  common  subexpression  values  that  were  saved  on  the  stack. 

Rather  than  using  a  top  of  stack  indexing  scheme,  another  alternative  would  be  to  require 
that  the  evaluation  stack  be  empty,  except  for  arguments,  when  a  firmware  subroutine  was 
invoked;  then  the  standard  register  numbering  scheme  could  be  used.  However,  this  alternative 
does  not  support  nested  subroutine  calls.  Furthermore,  with  true  random  access,  all  registers 
must  be  saved  on  a  context  switch  because  there  is  no  way  of  knowing  which  ones  are  "live".  In 
the  first  design,  the  value  of  the  evaluation  stack  pointer  indicates  the  number  of  active  cells. 

43  Fast  procedure  calls 

According  to  our  data,  the  average  number  of  instructions  per  procedure  was  66  and  the 
average  number  of  procedure  calls  per  procedure  was  6.  We  suspect  that  the  dynamic  frequency 
of  procedure  calls  is  even  higher.  In  addition,  procedure  calls  are  relatively  expensive.  Thus,  it  is 
important  to  minimize  their  cost.  There  have  been  a  number  of  good  ideas  that  we  would  like  to 
explore.  However,  the  Lilith  architecture  does  suggest  at  least  one  approach  that  has  not  been 
tried  before. 

On  the  Lilith,  the  arguments  to  a  procedure  are  passed  on  the  evaluation  stack. 
Furthermore,  the  stack  space  used  by  a  procedure’s  frame  can  be  determined  at  compile  time. 
As  a  result,  all  procedure  calls  within  a  particular  procedure  generate  the  frame  descriptors  for 
the  called  procedures  at  exactly  the  same  address.  Since  a  frame  descriptor  is  four  words.  th_s 
space  could  remain  allocated  for  the  duration  of  a  procedure’s  execution.  Of  the  four  cells,  only 
the  return  address  word  would  need  to  be  written.  This  design  could  help  to  reduce  memory 
traffic,  an  important  consideration  in  a  multiprocessor. 

5.  Summary  and  Acknowledgements 

Obviously,  there  are  many  design  decisions  to  consider.  Our  plan  is  to  implement  a 
hardware  simulator  first  and  then  bring  up  the  Lilith’s  operating  system  cn  the  simulator.  Thus, 
the  effects  of  design  decisions  on  system  performance  can  be  measured  in  a  more  realistic  way 
than  by  testing  with  only  user-level  programs. 

The  Lilith.mmp  project  is  a  joint  research  effort  with  Modula  Corporation  and  is  funded 
by  the  Virginia  Center  for  Innovative  Technology(CIT-INF-027)  and  the  Office  of  Naval 
Research  and  is  supported  by  grants  of  equipment  and  software  from  Burroughs  Corporation  and 
Modula  Corporation.  Mark  Wallitsch,  now  at  Bell  Labs,  collected  the  Lilith  statistics.  Professor 
Wirth  and  his  group  at  ETH  have  also  provided  invaluable  assistance  and  inspiration. 
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HIGH-PERFORMANCE  CPU  IMPLEMENTATION  OF  THE  NEBULA  INSTRUCTION  SET 

ARCHITECTURE 

R.  S.  Cheng 

RCA  Missile  and  Surface  Radar  Division 
Moorestown,  NJ  08057 

INTRODUCTION 


Limitations  of  16-bit  architectures  such  as  the  MIL-STD-1750  ISA  initiated  the  development  of  Nebula,  a  32- 
bit  ISA  standard  to  provide  better  support  for  military  applications.  Since  it  was  proposed  as  a  military 
standard,  it  was  optimized  for  Ada  programming  language  support.  Nebula’s  efficient  instruction  repertoire, 
variable  length  data  types,  powerful  task  and  procedure  control,  and  extensive  error  checking  interrupt  and 
trapping  mechanisms  allow  efficient  implementation  of  a  multi-tasking  environment  with  a  high  degree  of 
control  and  data  security. 


A  high-performance  CPU  subsystem  works  in  conjunction  with  a  two-level  hierarchical  memory  subsystem 
and  operates  in  parallel  with  an  autonomous  I/O  subsystem  that  executes  concurrent  I  O  channel  programs  A 
microprogrammable  CPU  architecture  was  adopted  for  design  flexibility  and  ease  of  modification  High 
throughput  performance  was  realized  through  the  use  of  extensive  parallelism  and  special  hardware  through¬ 
out  the  architecture. 

Designed  for  implementation  using  1.25-um  CMOS/SOS  technology  and  a  target  3-MIP  execution  rate,  the 
architecture  was  extensively  verified  using  Register  Transfer  Level  (RTL)  simulation.  It  was  then  demon¬ 
strated  via  breadboard  implementations  in  which  the  design  was  implemented  using  MSI.  LSI  technology  to 
provide  a  cycle-by-cycle,  functional  equivalent  of  the  projected  VLSI  implementation. 


The  VLSI  compatibility  of  this  architecture  was  demonstrated  by  replacing  core  CPU  functions  in  the  bread 
board  with  VLiil  circuit  prototypes  implemented  with  5-um  CMOS  SOS  technology.  These  scsiem.- 
delivered  to  the  Army  for  evaluation  in  1983,  serving  as  functional  representations  of  the  final  VLSI  imple¬ 
mentation. 


NEBULA  REQUIREMENTS 


Nebula  is  an  efficient  32-bit  ISA,  incorporating  numerous  features  that  directly  support  Ada  implementation 
A  32-bit,  byte  addressable,  virtual  address  space  facilitates  multi-tasking  and  multi-processor  configurations 
Variable  operand  sizes  ( 1,2,4,  and  8  bytes)  and  data  types  allow  efficient  support  of  Ada  data  types  Indepen¬ 
dent  specification  of  opcode  and  operand  types  in  a  variable-length  instruction  format  allows  flexible  and 
efficient  operand  addressing,  and  minimizes  instruction  code  space. 


Efficient  arithmetic.  logic,  and  bit-manipulation  instructions  are  provided  for  data  processing,  along  with 
special  instructions  that  support  task  loading  and  execution,  exception  handling,  and  program  control  IEEE 
standard  floating  point  arithmetic  and  error-handling  are  also  supported. 


The  independent  context-stack  and  procedure-based  control  structures  allow  efficient  implementation  of  Ada- 
level  procedure  calls  and  parameter  passing.  Special  task-control  instructions,  the  context  stack  structure, 
and  the  memory  management  scheme  together  provide  efficient  support  for  Ada  tasking.  Numerous  interrupt, 
exception,  and  trap-handling  mechanisms  are  defined  in  the  ISA  to  specifically  support  the  exception  and 
trap-handling  mechanisms  defined  in  Ada. 


DESIGN  GOALS  AND  STRATEGY 


High  system  throughput,  design  flexibility,  high  testability  and  reliability,  and  compactness  1  low  power  -mall 
size  light  weight'  were  the  major  design  goals.  The  complex  requirements  of  the  Nebula  ISA  and  the  VLSI 
design  constraints  further  complicated  the  design  task 


These  numerous  design  goals  and  objectives  had  to  be  prioritized  and  quantified  so  that  sensible  trade*  41- 
could  be  made  In  addition  to  establishing  a  minimum  performance  requirement  level  for  each  design  ohie* 
tive.  a  higher-level  performance  goal  was  also  set  for  attainment.  The  major  design  strategies  employed  are 
discussed  below 
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Optimization  Based  on  Instruction  Mix 

The  design  was  optimized  based  on  the  Instruction  Mix,  which  identifies  the  relative  frequency  of  occurrence 
of  all  instructions,  addressing  modes,  data  types,  and  various  modes  of  operation.  For  example,  register  and 
literal  operands  are  used  most  frequently;  typically,  7  or  less  parameters  will  be  passed  in  procedure  calls: 
data  elements  are  typically  4  bytes  or  less  in  size,  and  so  on.  These  assumptions  were  used  to  optimize  the 
design  by  focusing  on  the  more  frequent  operations  and  needs. 

Cost-Efficient  Parallelism  for  Speed 

Extensive  parallelism  was  introduced  at  all  design  levels  to  achieve  the  3-MIP  throughput  goai.  Instruction 
prefetching  and  parallel  decoding  were  overlapped  with  execution.  Data  processing  and  address  computations 
were  paralleled  by  separate  ALU  facilities.  Two  banks  of  register  cache  were  used  to  expedite  context  stack 
switching.  All  hardware  operates  in  parallel  under  the  control  of  the  highly  horizontal  microprogram  struc¬ 
ture.  Cost-effective  hardware  features  were  used  throughout  the  CPU  to  increase  parallelism. 

Hardware  Redundancy  for  Reliability/Maintainability 

With  high  reliability  and  maintainability  as  a  key  design  objective,  extensive  redundant  hardware  and  Built- 
In-Test  features  were  used  to  support  efficient  fault  detection  and  isolation  in  the  CPU  design.  Chip-level 
master-slave  redundancy,  level-sensitive  scanning  techniques,  and  a  Built-In-Test  (BIT)  processor  were  em¬ 
ployed  to  provide  (1)  on-line  fault  detection  without  throughput  degradation,  and  (2)  off-line  fault  isolation 
down  to  a  replaceable  VLSI  component. 

Programmability  for  Design  Flexibility  and  Ease  of  Modification 

The  programmability  of  the  Nebula  ISA  design  was  emphasized  to  provide  the  flexibility  needed  to  accommo¬ 
date  future  enhancements.  A  inicroprogrammable  architecture  was  selected  primarily  for  this  reason  How¬ 
ever,  throughput  periormance  was  not  sacrificed  by  providing  a  highly  horizontal  control  structure  with 
minimum  decoding  overhead.  Although  a  simple,  fixed-cycle  synchronous  interface  was  used  with  the  high¬ 
speed  cache  to  boost  system  throughput,  design  provisions  allow  wait  states  to  be  inserted  when  slower 
memories  need  to  be  employed.  By  the  same  token,  variable  length  CPU  cycles  are  supported;  microprogram- 
mable  cycle  lengths  accomodate  the  different  critical  paths  encountered.  This  also  provides  design  flexiblity 
for  future  enhancement  and  technology  insertion. 

Careful  Consideration  for  VLSI  Implementation 

Functional  and  VLSI  partitioning  played  a  major  role  in  the  design  process,  enabling  a  given  architecture  to 
be  implemented  in  a  given  VLSI  circuit  and  packaging  technology.  The  objectives  were  to  minimize  the 
number  of  unique  chip  types  and  the  number  of  chips  in  the  system,  thereby  minimizing  development  and 
production  costs  while  providing  the  desired  system  performance.  In  general,  minimizing  the  required  transis¬ 
tor  and  pin  counts  when  implementing  a  particular  function  penalizes  performance.  Consequently,  careful 
tradeoff  decisions  were  required.  Tight  encoding  in  all  non-critical  paths  minimized  the  I  O  pins  required  at 
the  expense  of  more  encoding/decoding  hardware  and  time.  Redundant  hardware  was  sometimes  used  for 
"look-ahead”  processing  to  alleviate  critical  paths.  In  some  cases,  redundant  hardware  was  distributed  in 
multiple  chips  to  enhance  throughput  or  reduce  I/O  pin  requirements.  Bit-slicing  was  also  used  to  reduce  I  O 
pin  requirements  and  maximize  common  chip  types. 

To  allow  optimum  use  of  silicon  at  reasonable  development  time  and  cost,  a  mixture  of  custom  (handcrafted' 
design  and  standard  cell  techniques  were  employed  in  developing  the  chip  set.  Standard  cell  technology 
supported  by  a  well-defined  set  of  CAD  (Computer  Aided  Design)  tools  and  layout  techniques,  was  selected  for 
design  efficiency.  Handcrafted  design  was  employed  for  regular  structures  to  improve  speed  and  densitv 
performance.  RCA’s  1.25-um  CMOS/SOS  technology  was  chosen  for  performance-critical  areas,  while  bulk 
CMOS  technology  was  selected  for  other  areas  to  reduce  system  cost.  The  packaging  scheme  consisted  of  132- 
pin,  leadless  chip  carriers  on  ceramic  substrates,  mounted  on  glass  epoxy  boards 

Global  Optimization  to  Balance  Conflicting  Goals 

Performance  goals  and  requirements,  NRE  (Non-Recurring  Engineering)  development  cost,  and  system  life¬ 
cycle  cost  were  all  significant  concerns  that  affected  our  Nebula  design  decisions  Additional  hardware  gener¬ 
ally  enhances  throughput  performance;  however,  it  also  increases  svstem  complexity,  cost.  size,  weight,  and 


power,  and  reduces  reliability  and  testability.  Built-In-Test  circuitry  enhances  testability  and  reliability, 
however,  it  may  reduce  system  throughput  and  add  hardware  cost  and  system  complexity.  The  various  design 
goals  and  performance  concerns  need  to  be  examined  simultaneously  in  tradeoff  studies. 

NEBULA  PROCESSOR  SYSTEM  ARCHITECTURE 

The  complete  computer  system  can  be  divided  into  5  subsystems,  as  shown  in  Figure  1;  namely  the  CPU. 
memory,  I/O,  Built-In-Test  (BIT)/Maintenance  Interface,  and  support  subsystems.  Each  subsystem  is  brieflv 
described,  and  the  CPU  subsystem  is  discussed  in  greater  detail. 
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Figure  1.  General  Block  Diagram  of  AN'UYK-41  Super  Minicomputer 


The  CPU  subsystem  is  a  microprogrammable  architecture  designed  to  execute  the  Nebula  ISA  at  a  projected  i 
MIP  execution  rate  It  contains  a  highly  pipelined  control  path  for  instruction  prefetching,  decoding,  and 
microprogram  sequencing  functions;  8K  x  128-bit  microprogram  control  store  for  ISA  execution  and  diagnos¬ 
tics;  a  32-bit  ALU  data  path  augmented  with  special  function  units  for  high-speed  data  processing;  and  special 
interface  logic  to  the  other  subsystems. 

The  memory  subsystem  was  implemented  as  a  2-level  hierarchy:  a  primary  storage  composed  of  a  high-speed 
4K-word  set  associative  cache,  and  4  megabytes  of  bulk  memory  as  secondary  storage  Virtual  cache  was 
selected  to  eliminate  the  address  translation  from  the  critical  path  in  order  to  support  fast  CPU  access  The  4 
megabytes  of  bulk  memory  were  implemented  as  2  pairs  of  1-megabvte  modules  interleaved  to  support 
efficient  cache  update.  With  this  configuration,  a  cache  hit  will  return  a  32-bit  word  in  40  ns  from  cache  On  a 
cache  miss,  four  32-bit  words  are  read  from  bulk  memory  on  two  consecutive  cycles  to  update  the  cache,  and 
the  required  word  is  available  to  CPU  in  240  ns. 

The  I/O  subsystem  contains  independent  I/O  Processors  <'.OPl  that  execute  the  Nebula  instruction  set  This 
subsystem  supports  high-speed  data  transfers  between  bulk  memory,  and  up  to  seven  independent  I  O  inter¬ 
faces  concurrent  to  CPU  operation.  In  addition,  the  10  subsystem  supports  data  transfers  between  the  bulk 
memory  and  the  BIT/Maintenance  subsystem  to  support  the  test  and  maintenance  function. 

The  BIT/Maintenance  Interface  subsystem  provides  the  central  control  for  fault  isolation  and  maintenance 
support.  It  (1)  accepts  errors  detected  by  the  various  BIT  circuits  distributed  throughout  the  CPU.  > 2 1  main¬ 
tains  the  fault  log,  and  1 3 >  interrupts  the  CPU  to  invoke  appropriate  diagnostic  operations  CPU  interna! 
states  can  be  loaded  and  accessed  by  a  built-in  scanning  path  controlled  by  the  BIT  processor,  and  results  can 
be  analyzed  for  fault  isolation. 


Finally,  the  support  subsystem  is  composed  of  the  power  supply  and  the  necessary  mechanical  and  thermal 
support  for  the  complete  minicomputer  system 


NEBULA  PROCESSOR  CPU  ARCHITECTURE 


A  highly  horizontal,  microprogrammable  CPU  with  a  microword  width  of  128  bits  was  designed  to  control 
autonomous  hardware  units  in  parallel  for  high  system  throughput.  Extensive  pipelining  allows  macro-level 
instruction  fetch  and  decode,  as  well  as  microprogram  sequencing  functions,  to  be  overlapped  with  instruction 
execution.  The  parallelism  involved  in  a  typical  instruction  execution  is  shown  in  Fig.  2.  Numerous  special 
hardware  features  efficiently  support  Nebula-specific  operations  to  enhance  system  performance.  A  hierarchi¬ 
cal  Built-In-Test  approach  provides  concurrent  fault  detection  with  no  system  performance  degradation  on 
non-error  conditions,  and  off-line  fault  isolation  down  to  a  specific  VLSI  component  after  errors  are  detected 
The  CPU  operates  at  a  25-MHz  clock  rate,  allowing  a  typical  register-based  instruction  to  be  executed  in  a 
single  40-ns  microcycle. 
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Figure  2.  Execution  of  Typical  Nebula  Instruction  Stream 


The  complete  CPU  subsystem  is  functionally  and  logically  divided  into  4  sections,  implemented  on  four 
ceramic  substrates  as  shown  in  Fig.  3.  These  include  the  Control  Pipeline  Substrate,  the  Arithmetic  Substrate, 
the  Address  Substrate,  and  the  Special  Function  Substrate.  With  a  total  hardware  redundancy  at  the  chip 
level,  the  CPU  subsystem  can  be  implemented  with  a  total  of  24  VLSI  chips  (9  unique  chip  tvpesi  and  HK  x 
128-bit  ROM. 

Control  Pipeline  Substrate  (IPU,  IDU,  MSU) 

Three  VLSI  chip  types,  the  Instruction  Prefetch  Unit  (IPU),  the  Instruction  Decode  Unit  iIDL\  and  the 
Microprogram  Sequencer  Unit  (MSU)  are  used  to  implement  the  CPU  control  function.  A  total  of  6  VLSI 
components.  2  of  each  type,  provide  a  fully  redundant  control  unit,  performing  Nebula  instruction  prefetching, 
decoding,  and  microprogram  sequencing  functions.  These  chips  range  from  20,000  to  35,000  transistors  in  size 

One  major  difficulty  posed  by  the  Nebula  ISA  is  the  independence  between  opcode  and  operand  specification  in 
a  variable- length  instruction  format.  An  instruction  typically  ranges  from  I  byte  to  over  30  bytes  long  and.  in 
the  special  case  of  a  Procedure  call  instruction,  it  may  be  well  over  1000  bytes  long,  supporting  up  to  250 
parameters.  A  16-byte  instruction  queue  controlled  by  a  parallel  hardware  decoding  scheme  allows  effective 
overlap  of  instruction  fetching  and  decoding  with  instruction  execution.  The  selected  size  of  the  queue  provides 
the  storage  needed  to  support  all  cases  of  the  prefetching  function,  yet  minimizes  the  queue  refill  overhead 
during  branch  conditions. 

Whenever  the  queue  can  accept  4  or  more  bytes,  and  the  data  bus  is  available,  four  instruction  bytes  will  be 
fetched.  Instruction  data  are  continually  fetched  into  the  queue,  and  subsequently  pipelined  into  the  decoding 
path  Two  hardware  pointers  generated  by  the  decoding  logic  extract  variable  numbers  of  bvtes  from  the 
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Figure  3.  CPU  Block  Diagram 


instruction  queue  into  two  levels  of  pipeline  registers.  One  pointer  points  to  the  next  opcode  or  operand 
specifier  to  be  decoded,  while  the  other  pointer  identifies  the  starting  byte  of  the  in-line  address  or  data  as  a 
result  of  the  last  decoding  phase. 


The  in-line  data  is  formatted  and  then  stored  into  dedicated  pipeline  registers.  Addresses  are  sign  extended 
and  data  literals  are  duplicated  to  4  bytes.  Simultaneously,  register  numbers  and  short  literals  are  encoded  to 
facilitate  data  manipulation  and  address  calculation  in  subsequent  operand  access  and  execution  phases. 
Error  checking  is  also  performed  during  the  decoding  process.  Any  exceptions,  including  memory  protection 
violations,  are  pipelined  so  that  the  exceptions  are  raised  when  execution  is  attempted. 


The  opcode/operand  specifiers  are  sent  to  the  decoding  path  and  decoded  along  with  the  current  opcode  and 
machine  status  to  generate  control  for  the  instruction  queue  and  the  starting  address  of  the  microprogram 
routines.  Since  the  register  and  in-line  short  literal  modes  are  used  most  frequently,  based  on  the  instruction 
mix,  their  decoding  and  set  up  are  completely  performed  by  the  decoding  hardware  and  overlapped  with 
current  CPU  execution.  Register  numbers  or  short  literals  are  encoded  directly  by  hardware  into  operand- 
select  registers,  which  are  used  to  access  the  data  at  execution  time  In  addition,  two  consecutive  bytes  are 
always  decoded  simultaneously,  allowing  as  many  as  two  operands  to  be  set  up  in  parallel 


Similarly,  for  memory  operands  where  aidress  calculation  and  data  fetch  are  required,  the  hardware  lotric  wil 
parse  the  operand  specifiers  and  encode  required  information  before  invoking  a  common  set  of  operand  acce-- 
micro-routines  for  the  operand  fetch.  After  a  maximum  of  3  operand-access  routines,  control  is  transferred  n 


ism*. 


an  opcode-specific  execution  routine,  where  operand  data  are  indirectly  addressable  by  the  encoded  informa¬ 
tion  in  operand-select  registers.  With  this  approach,  all  complex  addressing  modes  can  be  supported,  while  the 
simple  but  frequent  register-to-register  instruction  can  still  be  executed  in  a  single  40-ns  cycle.  Data  size 
information  is  also  encoded  in  these  operand-select  registers,  allowing  variable  data  operands  to  be  processed 
without  additional  overhead  and  delay. 

A  16-level,  Last-In-First-Out  (LIFO)  stack  is  included  to  support  micro-level  interrupts  and  subroutine  calls, 
and  a  14-bit- wide  microprogram  address  space  supports  the  addressing  of  up  to  16K  words  of  microprogram 
store.  A  16-bit  counter  and  various  support  logic  support  multi-way  branching,  micro-level  looping,  and 
flexible  conditional  case-branches  based  on  externally  supplied  condition  codes.  For  example,  when  a  short 
parameter  mode  is  encountered,  a  3-bit  parameter  number  and  an  illegal  parameter  indicator  can  be  fed  from 
the  decoding  logic  to  allow  a  multi-way  branch  into  unique  parameter  access  or  exception  routines. 

Arithmetic  Substrate  (RALU) 


One  VLSI  chip  type,  the  Register  Arithmetic  Logic  Unit  (RALU),  is  used  to  implement  the  required  data 
processing  function.  RALU  is  configured  as  a  16-bit,  bit-slice  VLSI  device;  two  RALU  devices  provide  the 
desired  32-bit  internal  data  path.  This  substrate  is  responsible  for  all  the  arithmetic  and  logic  operation-- 
required  to  support  the  instruction  set.  The  RALU  is  implemented  as  a  132-pin  package  containing  approxi¬ 
mately  45,000  transistors. 

To  support  the  frequently  used  register-addressing  mode  and  a  typical  3-operand  instruction  format,  a  36- 
word  by  16-bit  customized  3-port  register  stack  was  used  in  the  design.  Three  operand-select  registers,  directly 
encoded  by  the  instruction  decoding  logic  (one  dedicated  for  each  operand),  provide  indirect  access  to  the 
register  stack.  A  fourth  operand-select  register,  loaded  via  a  micro  literal,  allows  efficient  indirect  access  to 
the  stack  under  microprogram  control.  The  register  stack  is  also  directly  addressable  by  microcode  in  a 
conventional  way,  without  using  any  operand-select  registers.  To  expedite  switching  between  the  task  and 
kernel  context  stacks,  two  register  stacks  were  used.  This  allows  supervisory  or  interrupt  handling  functions 
to  be  invoked  readily  on  the  kernel  context  stack,  without  having  to  save  and  restore  the  registers  of  the 
interrupted  task. 

A  32-bit  ALU  provides  the  normal  arithmetic  and  logic  operations.  Special  hardware  logic  is  provided  to 
support  various  Nebula-specific  operations.  These  include  sign  extension:  byte  and  bit-manipulation,  trunca¬ 
tion  and  rounding  operations  as  controlled  by  a  combination  of  external  condition  codes;  various  machine 
states;  and  microprogram  control. 

Address  Substrate  (MAU) 

One  VLSI  chip  type,  the  Memory  Address  Unit  (MAU),  supports  all  the  auxiliary  functions  such  as  data 
operand  address  computation,  parameter  decoding/encoding,  program  counter  update,  etc.,  in  parallel  with 
data  processing  on  the  arithmetic  substrate.  The  MAU  is  structuallv  very  similar  to  the  RALU.  and  wa- 
implemented  as  a  16-bit,  bit-slice  device;  two  MAUs  are  used  for  implementing  a  32-bit  address  computation 
unit.  It  contains  various  special  logic  to  support  the  auxiliary  functions,  and  operates  autonomously  in  parallel 
with  the  arithmetic  substrate.  The  MAU  can  be  implemented  as  a  132-pin  package  with  approximately  45.000 
transistors. 

A  32-word  by  16-bit  register  stack  caches  context-stack  information  for  fast  access.  The  information  includes 
up  to  seven  parameter  descriptors  and  other  information  associated  with  the  procedure  context  Similarly,  a 
dual-register  cache  approach  is  employed  to  expedite  kemeUtask  context-stack  switching.  Three  dedicated 
address  registers  specify  the  memory  address  for  up  to  3  operands  in  an  instruction,  and  such  registers  are 
directly  accessible  under  microprogram  control. 

Special  logic  is  included  to  encode  and  decode  parameter  descriptors  to  expedite  procedure  calling  '  encoding 
and  parameter  access  (decoding).  The  update  of  the  Program  Counter  ( PC  >.  which  points  to  the  operand  hvte~ 
fetched,  is  performed  in  the  MAU  based  on  the  control  passed  from  the  instruction  decoding  logic.  In  addition 
to  the  normal  update  of  the  PC.  the  PC  is  also  pipelined  into  a  backup  register  at  the  Nebula  instruction 
boundary  This  PC  backup  register  and  special  logic  support  program  tracing  and  exception  and  trap  hun 
dling.  as  well  as  break  points. 
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Special  Function  Substrate  (ICU,  FPU,  MULT) 


The  Interrupt  Control  Unit  (ICU),  Floating  Point  Unit  (FPU),  and  a  Multiplier  (MULT)  provide  the  other  CPU 
auxiliary  functions  on  the  special-function  substrate.  The  ICU  contains  about  45,000  transistors,  while  FPU 
and  MULT  are  projected  to  have  35,000  transistors  each. 

The  ICU  provides  all  the  necessary  hardware  required  to  support  the  interrupt  and  trap  mechanism.  These 
include  counters  to  support  the  four  interrupt  timers  and  time-of-day  clock;  logic  to  support  the  maskable 
software  interrupt  requests;  and  a  bus  interface  to  accept  I/O  interrupts  from  the  I/O  subsystem.  To  enhance 
the  interrupt  response,  a  special  I/O-interrupt  bus  and  protocol  were  devised  to  allow  pending  interrupts  to  be 
polled  and  prioritized  in  parallel  with  CPU  execution.  The  results  are  then  sampled  at  macro-instruction 
boundaries,  where  all  interrupts  are  examined  and  acknowledged. 

In  addition  to  the  three  preceding  classes  of  interrupts,  various  run-time  exceptions  and  traps  are  also  handled 
simultaneouly  in  the  ICU,  based  on  a  pre-defined  priority  scheme  as  specified  in  the  ISA.  The  Processor  Status 
Word  (PSW),  which  resides  in  the  context-stack  memory,  is  cached  in  the  ICU  to  facilitate  interrupt,  excep¬ 
tion,  and  trap  checking  and  handling.  Similarly,  two  PSWs  are  cached  for  the  Kernel  and  Task  context  stacks 

The  FPU  and  MULT  were  built  as  hardware-assist  units,  enhancing  the  RALU  performance  on  multiplication 
and  floating  point  instructions.  The  FPU  supports  the  IEEE  standard  floating  point  format  as  required  by 
Nebula.  It  was  designed  to  operate  in  conjunction  with  the  RALU  to  support  the  execution  of  floating-point 
instructions.  The  FPU  contains  barrel  shifters,  exponent  adders,  normalization  logic,  and  the  extensive  error¬ 
checking  and  exception-handling  capability  required  to  support  the  floating  point  operations.  The  MULT  was 
designed  to  accept  a  pair  of  32-bit  operands  from  the  RALU  and  return  a  64-bit  product,  and  this  provides  a  7 : 1 
performance  enhancement  over  the  iterative  approach  supported  by  the  basic  RALU  capability 

Bus  Interface 

The  various  sections  communicate  over  two  32-bit,  bi-directional  data  buses,  the  Local  Data  Bus  1LBUS1  and 
the  CPU  Data  Bus  (DBUS).  LBUS  supports  inter-chip  communication  within  the  CPU,  and  part  of  the  bus  can 
be  sourced  by  microprogram  ROM  to  provide  a  micro  literal  for  masking  and  case  branching  operations  DBUS 
supports  communication  between  the  CPU  and  the  memory  subsystem,  and  also  provides  additional 
bandwidth  for  local  inter-chip  data  transfer. 

In-line  instruction  data  are  sent  from  IPU  to  the  various  processing  units  over  the  LBUS,  where  they  are 
properly  encoded  into  pipeline  registers.  Part  of  the  LBUS  can  be  multiple-sourced  with  hardware  condition 
codes  and  status  by  different  hardware  units  in  the  system.  This  part  of  the  LBUS  and  a  micro-literal  are  used 
to  form  specific  starting  addresses  in  MSU  that  allow  multi-way  branches  based  on  hardware  status  condi¬ 
tions.  To  preserve  microprogram  space,  a  simple  alignment  and  adder  logic  are  included  in  the  MSU.  to  control 
the  contiguous  address  space  of  such  micro-subroutines.  Special  bits  in  the  ICU-based  PSW  are  also  routed 
over  LBUS  to  different  parts  of  the  system  as  they  are  required  for  various  processing  and  error-checking 
operations. 

Memory  access  is  accomplished  over  the  DBUS,  where  memory  reads  are  supported  by  a  4K-word  cache,  and  a 
memory  write  is  queued  and  subsequently  written  to  both  cache  and  bulk  memory.  The  memory  is  byte 
addressable  as  defined  by  the  ISA,  while  a  32-bit  word-aligned  memory  is  physically  implemented.  One.  two. 
four,  or  eight  bytes  of  data  may  be  fetched  over  I  to  3  memory  cycles,  depending  on  the  address  of  the  starting 
byte.  Special  logic  is  included  in  the  RALU  to  efficiently  align  all  memory  data  during  read  and  write 
operations.  This  logic,  used  in  conjuction  with  the  conditional  branching  capability  in  MSU,  guarantees  that 
all  data  is  accessed  in  a  minimum  number  of  cycles.  A  memory  protection  check  is  performed  by  a  specific 
memory  management  unit  in  the  memory  subsystem,  while  exceptions  detected  will  be  received  by  the  CPI  as 
a  condition  code  to  initiate  a  memory  trap. 

Built-In-Test  Support  (EDC) 

Hardware  redundancy  at  the  chip  level,  level-sensitive  scanning  of  internal  machine  states,  and  a  BIT 
microprocessor  provide  on-line  concurrent  fault  detection  and  off-line  fault  isolation  down  to  a  single  chip  for 
replacement.  One  Error  Detection  and  Correction  <EDC)  chip  is  employed  on  every  substrate  to  collect  and 
report  errors  to  the  BIT  subsystem,  and  to  interface  the  CPU  with  the  BIT.  allowing  CPU  diagnostics  to  run 
under  the  control  of  the  BIT  processor. 


Every  VLSI  chip  in  the  system  is  configured  as  a  master-slave  redundant  pair;  both  members  receive  the  same 
system  inputs  and  produce  the  same  outputs.  Only  the  outputs  produced  by  the  master  chip  are  used  to  drive 
the  system.  These  outputs  are  also  sent  to  the  slave  counterpart,  where  all  the  redundant  outputs  are  com¬ 
pared.  A  single  chip-fault  signal  is  generated  by  combining  all  possible  faults.  All  chip  faults  are  collected  by 
the  EDC  chip  resident  on  that  substrate  and  communicated  to  the  BIT  processor.  Having  received  the  error 
reports,  the  BIT  processor  can  gain  control  of  the  complete  system  by  loading  specific  BIT  commands  into  the 
EDC  to  interrupt  and  halt  the  CPU.  CPU  internal  states  can  then  be  loaded  with  a  specific  test  pattern  using  a 
pre-defined  scan  path,  and  specific  diagnostic  routines  can  be  invoked.  Resulting  states  can  then  be  accessed 
and  analyzed  for  fault  isolation  purposes. 

To  minimize  pin  outs  and  the  basic  CPU  cycle  time,  the  microprogram  ROMs  are  distributed  onto  the 
individual  substrates  to  provide  the  necessary  control.  One  level  of  microinstruction  prefetching  is  supported, 
allowing  the  fetch  of  the  microinstruction  to  be  overlapped  with  its  execution.  Recognizing  that  the  micropro¬ 
gram  ROM  is  one  of  the  most  error-prone  and  functionally  critical  areas  in  the  CPU,  single-bit  error  detection 
and  double-bit  error  correction  is  concurrently  provided  with  microinstruction  execution.  Error  detection  is 
performed  on  the  microcode  as  it  is  executed.  If  no  error  is  detected,  there  is  no  performance  penalty.  If  an  error 
is  detected,  the  execution  will  be  cancelled  by  inhibiting  the  result  clock,  so  that  no  machine  states  are 
changed.  The  CPU  will  then  be  suspended  for  a  cycle  in  order  for  the  microcode  to  be  corrected  ( if  correctable  •. 
or  re-fetched  and  re-checked  (if  not  correctable),  and  execution  resumes  in  the  following  cycle.  If  the  error  in 
the  microcode  repeats  after  refetch,  the  CPU  is  halted,  and  control  is  turned  over  to  the  BIT  processor 

CONCLUSION 


A  sophisticated  ISA  like  Nebula  provides  a  software  programmer  with  numerous  flexible  features  that  are 
convenient  and  powerful  to  use.  However,  the  complexity,  reliability,  and  cost  of  such  implementations  are  of 
primary  concern. 

The  system  architecture  described  in  this  paper  provides  a  3  MIP  implementation  of  the  Nebula  ISA  using 
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generally  applicable  for  high-speed  processor  designs. 
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Introduction 

Digit  serial  data  transmission  can  be  used  to  an  advantage  in  the  design  of  spe¬ 
cial  purpose  processors  where  communication  issues  dominate  and  where  digit  pipelining 
can  be  used  to  maintain  high  data  rates  [DeFt].  VLSI  signal  processing  applications  are 
one  such  problem  domain.  We  have  developed  a  family  of  VLSI  components  which  have 
digit  serial  transmission  and  which  can  be  pipelined  at  the  digit  level.  These  com¬ 
ponents  can  be  used  to  construct  VLSI  processors  which  are  especially  suited  to  signal 
processing  applications.  One  particularly  attractive  such  processor  is  a  structure  we  call 
the  arithmetic  cube  [Owl],  The  arithmetic  cube  can  be  programmed  to  solve  linear 
transformations  such  as  convolutions  and  DFTs  ,  has  nearest  neighbor  interconnects, 
regular  layout,  simple  control,  and  a  limited  number  of  interconnections.  Regular  lay¬ 
out  and  simple  control  derive  naturally  from  the  algorithms  on  which  the  processor  is 
based.  Long  wires  are  eliminated  by  the  nearest  neighbor  interconnect.  High 
throughput  can  be  achieved  by  pipelining  the  processor  at  the  digit  level.  The  arithmet¬ 
ic  cube  is  programmable  in  the  problem  size  n  ;  once  implemented  for  a  certain  size  N , 
smaller  problems  can  be  solved  on  the  same  implementation  without  a  loss  in  perfor¬ 
mance.  In  addition,  the  architecture  extends  to  larger  N  in  a  regular  and  automatic 
fashion. 


Table  1  lists  the  system  conventions  to  which  our  VLSI  components  conform 
[DeR].  The  arithmetic  cube  contains  an  arithmetic  processor,  which  is  built  out  of 
adder  and  multiplier  components,  fed  by  a  memory. 


Convention  1: 
Convention  2: 
Convention  3: 

Convention  4: 

Convention  5: 
Convention  6: 
Convention  7: 
Convention  8: 
Convention  9: 
Convention  10 
Convention  11 
Convention  12 
Convention  13 


Table  1.  System  Conventions _ 

communication  is  base  4  digit  serial 

digit  pipelined,  msd  first 

high  signal  =  logical  1 

low  signal  =  logical  0 

on-chip  data  stable  during  phi-2 

off-chip  data  stable  during  phi-1 

numerical  format  is  fixed  point  fraction 

operand  word  length  is  fixed  and  constant 

multiple  precision  values  are  not  allowed 

components  have  fixed  and  bounded  latency 

inputs  to  components  are  time  aligned 

two  broadcast  control  signals  are  allowed 

one  control  signal  may  accompany  the  data 

primitive  components  are  fine  grain  in  size 

intercomponent  signal  wires  abut  or  river  route 


The  data  communication  format  is  digit  serial,  where  a  digit  is  represented  in 
base  4  signed-digit  format  [Atkj.  The  operand  digit  set  is  the  maximally  redundant 
symmetric  signed-digit  set  for  base  4  ({-3,  -2,  -1,  0,  1,  2,  3})  requiring  three  wires  to 
transmit  one  operand  digit,  in  a  two’s  complement  encoded  format.  All  components  are 
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digit  pipelined  with  up  to  three  digit  operands  input  and  three  digit  results  output.  The 
data  communication  flow  proceeds  in  a  right  directed  fashion  (from  most  significant  digit 
(msd)  to  least  significant  digit  (lsd)).  Operand  digits  are  input  one  set  per  clock  cycle. 
The  first  result  digit  is  generated  and  output  some  number  of  clock  cycles  after  the  in¬ 
put  of  the  first  set  of  operand  digits.  After  the  first  result  digit  is  output,  a  result  digit 
is  output  for  each  subsequent  cycle.  This  latency  between  the  when  the  first  digit  of 
data  is  input  and  when  the  first  digit  of  the  output  is  generated  is  measured  as  an  in¬ 
teger  number  of  digits.  We  require  that  all  components  have  a  fixed,  known  latency. 
Ideally,  this  latency  should  be  one  although  this  may  not  always  be  possible.  Our  primi¬ 
tive  components  can  be  classified  as  fine  grain  ;  that  is,  each  component  is  limited  in 
size  so  that  the  many  components  making  up  a  processor  can  fit  on  a  single  (or  a  very 
few)  chip(s).  Each  component  can  contain  no  more  logic  (or  area)  than  that  required  for 
a  one  digit  product  cell,  two  digit  adder  cells,  a  shift  register,  some  small  local  control, 
and  local  interconnect. 

Ideally  for  VLSI,  processors  should  be  meshes  or  linear  arrays  of  primitive  com¬ 
ponents,  interconnected  in  a  nearest  neighbor  fashion  so  that  the  intercomponent  inter¬ 
connects  automatically  abut.  The  arithmetic  cube,  shown  in  Figure  1,  maintains  the 
nearest  neighbor  interconnect  while  retaining  flexibility.  The  cube  contains  an  adder  ar¬ 
ray  with  N  adder  components  (A,  , ’s)  and  a  multiplier  vector  with  Vlv  multipliers 
(P,  ’s)  which  feeds  a  memory  array  with  \f~N  memories  (M,  's)  containing  \f~N~  words 
each.  Figure  1  shows  only  the  digit  wide  data  paths  interconnecting  the  arithmetic 
components  and  the  memories. 

After  discussing  the  two  primitive  arithmetic  components  used  in  the  cube,  we 
will  illustrate  how  the  cube  can  be  used  to  compute  Y  when  '’programmed"  with  H.  B. 
and  X  where  Y  is  defined  by  either 

Y  =  (h©Bx) 

Y  =  (H©Bx)r 

where  B  is  a  predefined  n0Xnt  matrix  whose  elements  are  either  -1,  0,  or  1.  H  is  a 
predefined  n0Xn2  matrix  whose  elements  are  fixed  point  numbers,  X  is  the  n  ,  X  n  in¬ 
put  matrix,  and  Y  is  the  n0Xn2  result  matrix.  We  will  then  show  how  these  opera¬ 
tions  can  be  combined  in  various  ways  to  compute  the  DFT . 
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Figure  1.  The  Arithmetic  Cube 


Cube  Architecture 


The  arithmetic  cube  contains  an  adder  array  with  N  programmable  adder  com¬ 
ponents  (A,  •  ’s),  shown  in  Figure  2,  which  has  been  "folded”  along  the  diagonal  to 
maintain  nearest  neighbor  interconnect  between  the  first  column  of  adders  and  the  mul¬ 
tipliers.  The  Load  signal  loads  the  p  digit  circular  shift  register  r0,  r  „  .  rp_,.  At 

the  end  of  the  Load  operation  rp  holds  the  most  significant  digit  of  the  loaded 
operand. 


Load 

Mode 


where 


Load  Cycle  (Load 
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Add  Cycle  (Load  =  0): 


xout  xin 


ro  =  rp-i 


r,  =  r,_|  for  i 

=  1, 

2,  •  • 

■  P-1 

Mode  0  (Mode  = 

0): 

Xin 

-4Cj 

'frP- 1 

> 

0 

(1)  Sj  =  z,„  +  « 

0 

if  rp  -l 

= 

0 

~x,n 

-4  Cj 

*/  rP  -1 

< 

0 

where  -1  <  c;-  <1  and  -2  <  Sj  <2 

(2)  =  *j- 1  +  «y 

Mode  1  (Mode  =  1): 

{rp-l"4c>  */  x»n  >0 
0  if  Xin  =  0 

-rp_1-4cJ  */*,•„  <0 

where  -1  <  Cj  <1  and  -2  <  Sj  <2 


(2)  z0ut  =  *j- 1  +  cy 


Figure  2.  Cube  Adder 


In  the  adder,  the  present  sum  digit  s} ,  which  is  determined  by  the  first  digit  ad¬ 
dition  operation  (1),  is  latched  for  use  in  the  next  cycle.  Then  a  second  digit  addition 
operation  (2)  adds  the  present  carry  to  the  previous  sum  and  this  result  is  output.  Most 
importantly  this  second  addition  is  guaranteed  to  be  carry  free  because  of  the  restricted 
digit  sets  allowed  for  the  carry,  c;- ,  and  the  sum,  Sj  .  The  latency  of  this  adder  is  one 
digit.  The  addition  component  requires  two  digit  adder  cells,  a  one  digit  storage  cell  for 
the  present  sum  digit,  and  a  shift  register  for  R  .  Only  one  component  is  needed  regard¬ 
less  of  the  operand  precision.  The  number  of  digits  of  storage  provided  in  the  shift  re¬ 
gister  will  be  the  upper  limit  on  the  precision  of  the  operands. 

Using  just  the  adder  array  and  the  memory,  the  arithmetic  cube  can  be  used  to 
compute  Z  =  B  X.  A  set  of  memories  is  said  to  contain  B  (Br)  if  the  t  'th  memory 
contains  the  t  ’th  row  (column)  of  B.  With  the  memories  M,  .  0  <  i  <  n  ,,  initially 
containing  the  i  ’th  column  of  B,  the  shift  registers  in  each  adder  can  be  loaded  by- 
shifting  the  digits  of  the  elements  of  B  along  the  data  paths  connecting  the  memory  and 


adder  array.  Then  with  the  Mode  line  held  at  0,  the  digits  of  the  elements  of  X  are  cy¬ 
cled  through  the  adder  array  with  the  digits  of  the  elements  of  Z  being  collected  back 
into  the  memories.  The  time  needed  is  0{nQ+n  j+n  «)  independent  of  N.  In  a  similar 
fashion,  the  cube  can  be  used  to  compute  Z  =  I  B  X  J  T  .  If  the  memories  initially  con¬ 
tain  Br  and  X,  ZT  can  be  computed  by  transferring  the  digits  of  the  elements  of  X 
from  the  memories  to  the  adder  array  during  the  Load  cycle  to  load  the  shift  registers  of 
each  adder,  and  then  transferring,  while  the  Mode  line  is  held  at  1,  the  digits  of  the  ele¬ 
ments  of  B  from  the  memories  to  the  adder  array  and  at  the  same  time  transferring  the 
digits  of  the  elements  of  Z  from  the  adder  array  to  the  memories.  The  time  needed  is 
(^no+rij+n 2)-  Which  Qf  the  two  operations  is  performed  by  the  adder  array  is  deter¬ 
mined  by  the  order  in  which  the  operands  are  supplied  and  the  Mode  control  line. 

The  cube  also  contains  a  multiplier  vector  with  \ZW  multipliers  (P,  ’s).  Our 
semi-systolic,  programmable  multiplier  is  built  out  of  S  components  each  of  which  con¬ 
tains  a  digit  product  cell  and  two  digit  adder  cells.  To  multiply  two  p  digit  values, 
p  4-  3  S  components  are  interconnected  as  illustrated  in  Figure  3  where  p  =  4.  The  S 
components  are  loaded  during  the  Load  operation  with  the  multiplier, 
Y  =  (.y  xy2y  32/4)4*  which  has  been  recoded  [Atk]  so  that  y,  e  {-2,  -1,  0,  1,  2}.  During 
processing  the  digits  of  the  multiplicand,  msd  first,  are  broadcast  on  the  xin  line  to  the 
S  components  in  the  multiplier.  All  other  data  is  passed  from  component  to  neighbor¬ 
ing  component  in  a  systolic  fashion.  The  latency  of  the  multiplier  is  two. 
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Load  Cycle  (Load  =  1): 
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Multiply  Cycle  (Load  =  0): 
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(2)  Sj  +  c,n 

Figure  3.  Cube  Multiplier 

Using  just  the  multiplier  vector  and  the  memory,  the  arithmetic  cube  can  be  used 
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to  compute  Y  =  H®Z.  If  the  memories  Mj,  0  <  i  <  n0,  initially  contain  the  / 't h 
rows  of  Z  and  H,  then  Y  can  be  computed  by  cycling  the  data  values  through  the  mul¬ 
tiplier  vector.  The  time  needed  is  0(no  4-  n2). 

If  the  memories  initially  contain  H,  Br ,  and  X,  then 

(1)  Y  =  (H®BX) 

can  be  computed  by  transferring  the  elements  of  B  from  the  memories  to  the  cube  to 
load  the  adders  and  then  transferring  the  elements  of  H  and  X  from  the  memories  to 
the  cube  and  at  the  same  time  transferring  the  elements  of  Z  from  the  cube  to  the 
memories.  In  a  similar  fashion,  if  the  memories  initially  contain  Hr  .  ,  and  X  then 

(2)  Y  =  (Hr  ®(BX)r  )  =(H®BX)r 

can  be  computed.  If  H  L-  y  =  D  ,  ,• ,  0  <  i  ,  j  <  n  ,  where  D  is  a  diagonal  ma¬ 
trix,  then 

(3)  Y  =  (H®BX)  =  (dBx) 

is  a  special  case  of  (l).  If  |h  j,  =  1,  0  <  i  <  n  ,  and  0  otherwise,  then 

(4)  Y  =  (Hr  ®  (B  X)  t  )  =  (B  X)  t 

is  a  special  case  of  (2).  The  time  needed  to  compute  any  of  these  four  operations  is 
O(n0+n  !+n2)  independent  of  N,  the  size  of  the  arithmetic  cube.  Hence,  there  is  no 
penalty  in  solving  a  small  problem  on  a  large  arithmetic  cube. 

Cube  Algorithms 
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We  will  now  describe  how  these  four  operations  can  be  usee  ^,to  compute  the 

DFT .  The  discrete  Fourier  transform  y  =  y0,  y  •  •  •  ,  yn_x  of  n  points 

I T  L  -> 

xn_!  ,  DFT  [n  ),  is  given  by  y  =  W  x,  where  W  is  a  n  X  n 

matrix  whose  elements  are  defined  by  j,  y  =  ,  0  <  i ,  j  <  n  ,  and  u>  is  the 

n  ’th  root  of  unity.  The  discrete  Fourier  transform  can  be  computed  using  small  n  al¬ 
gorithms  and  their  formulations  by  Good  and  Winograd  [AgC,  CoT,  E1R,  Goo,  KoP. 
Wil,  Wi2].  The  original  motivation  behind  the  small  n  algorithms  was  to  develop  algo¬ 
rithms  to  compute  either  the  Fourier  transform  or  cyclic  convolution  which  use  as  few 
multiplications  as  possible.  This  reduction  in  the  number  of  required  multiplications  is 
at  the  expense  of  apparently  unstructured  sets  of  additions  and  subtractions.  Since 
multipliers  are  usually  much  larger  than  adders,  it  is  desirable  to  minimize  the  number 
of  multipliers  even  at  the  expense  of  possibly  increasing  the  number  of  adders.  For¬ 
tunately,  the  apparently  unstructured  sets  of  additions  and  subtractions  do  have  a 
structure  which  maps  in  a  straightforward  manner  onto  the  cube  [Owl].  These  algo¬ 
rithms  are  based  on  the  Chinese  Remainder  Theorem  for  polynomials  and.  when  data  is 
real,  do  not  use  sines,  cosines,  or  complex  arithmetic. 
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The  small  n  algorithms  consist  of  three  steps:  a  set  of  input 
additions/subtractions;  a  set  of  scalings;  and  a  set  of  output  additions/subtractions 
represented  by  three  matrices  S,  C,  and  T  such  that  y  =  SCTx  where  T  is  a 
predefined  6  X  n  matrix  whose  elements  are  either  0,  1,  or  -1  representing  the  input 
additions/subtractions,  C  is  a  predefined  6  X  <5  diagonal  matrix  representing  the  6  mul¬ 
tiplications,  and  S  is  an  predefined  n  X  <5  matrix  whose  elements  are  either  0,  1,  or  -1 
representing  the  output  additions/subtractions.  Optimal  small  n  algorithms  have  been 
derived  for  n  =  2,  3,  4,  5,  7,  8,  9,  16. 


Small  n  algorithms  become  impractical  for  large  values  of  n  .  For  larger  n  the 
DFT  (n  )  can  be  computed  by  combining  an  appropriate  set  of  small  n  algorithms. 
Three  reductions  for  computing  the  DFT  (n  )  will  be  considered.  In  the  following  discus¬ 
sion  (Sni,  Cni,  Tni)  represent  the  small  n  algorithm  for  DFT^iii),  0  <  i  <  /  where 
n  =  n0  «]  •  nj_j  such  that  (usually)  the  n,  ’s  are  relatively  prime. 

A)  The  Good’s  ( prime  factor)  algorithm  [Goo]: 

S"i 

where  X  and  Y  are  n0Xn!  matrices  holding  the  input  and  the  output  vectors.  Both 
n  0  and  n  L  must  be  relatively  prime.  Good’s  algorithm  can  be  reformulated  as 

Y  =  (B1(D2B2(B3(D4B4x))r  )  )T 
where 


Y  =  (sn°  Cn°  T"°( 


B1  =  S"°,  D2  =  Cn°,  B2  =  Tn°,  B3  =  S"',  D4  =  Cn\  B4  =  T"1. 

In  this  form,  Good’s  algorithm  uses  the  series  of  four  cube  operations:  3  — ►  4  — >  3  — 1 >  4. 
B)  The  Winograd  algorithm  [Wil,  Wi2]: 

Y  =  Sn‘(sno  C©Tn°(Tn‘  x)T  )T 

where  ©is  element-wise  multiplication  and  C  is  the  constant  n0  X  n  ,  matrix 

[c]„  =  [c"“],  ,  [c"']J  :,0  <  ,  <  n„.0  <  j  < 

Both  n0  and  n  t  must  be  relatively  prime.  Winograd’s  algorithm  can  be  reformulated  as 

Y  =  (b1(b2(h3©B3(b4  x)7  j)7  ) 

where 


B2  =  S"°,  B3  =  Tn°,  H3  =  C,  B1  =  Snh  B4  =  Tn‘ 

In  this  form,  Winograd’s  algorithm  uses  the  series  of  four  cube  operations: 

4  — *  !  — *  4  — *•  3. 
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C)  The  mixed  radix  (FFT-like)  algorithm:  [OwJ] 

Y  =  (sn°  Cn°  Tn°(w©Sn‘  Cn'  Tn‘  X)r)r 

where  W  is  the  constant  n0  X  n  l  matrix 

[w  ji  ;  =  ,  0  <  i  <  n0,  0  <  j  <  n  j. 

In  this  case  n0  and  nl  need  not  be  relatively  prime.  The  mixed  radix  algorithm  can  be 
reformulated  as 

Y  =  (b1(d2B2(h3©B3(d4B4x)  )T  )  )T 

where 

B1  =  S1,  D2  =  C1,  B2  =  T1,  B3  =  S2,  D4  =  C2,  B4  —  T2.  H3  =  W 

In  this  form,  the  mixed  radix  algorithm  uses  the  series  of  four  cube  operations: 
3  —  2  — *  3  — *  4. 

Conclusions 

The  decomposition  of  interest  for  the  DFT  (n  )  is  that  where  a  (n0,  n  ,)  for  n  ex¬ 
ists  such  that  n  q,  n  x  =  0(\ZtT  )  referred  to  as  an  optimal  decomposition.  In  this  case, 
the  time  needed  to  compute  each  of  the  four  cube  operations  reduces  to  0{sfn  )  When 
an  optimal  decomposition  is  used,  the  time  needed  to  compute  DFT  (n  )  is  0(  \fn  )  For 
example  the  time  needed  to  compute  DFT  ( 210)  using  the  decomposition  (14.  15)  is 
0(29)  while  the  time  needed  using  the  decomposition  (3.  70)  is  0(73). 

For  a  given  precision,  each  adder  and  multiplier  have  constant  area.  Hence,  a 
VLSI  implementation  of  the  arithmetic  cube  will  have  area  O(.V).  For  an  optimal 
decomposition,  the  area  actually  used  is  0(n  )  giving  an  .1  7’ 2  bounds  for  the  arithmet¬ 
ic  cube  of  0(n2).  For  the  VLSI  models  where  is  the  time  needed  to  move  a  given  atom¬ 
ic  piece  of  data  distance  d  is  bounded  by  Fl{d  )  or  where  the  amount  of  data  which  may 
cross  a  boundary  of  length  l  is  bounded  by  0{l ),  the  A  T 2  bounds  to  compute 
DFT  ( n  )  is  Q(n2).  Within  the  framework  of  this  model,  the  arithmetic  cube  is  optimal. 

At  present  the  S  component  was  designed  in  CMOS  as  part  of  the  MOSIS  1985 
summer  VLSI  course.  The  cells  of  the  adder  are  a  generalization  of  a  earlier  NMOS  pro¬ 
ject.  Preliminary  results  indicate  that  the  arithmetic  cube  for  a  reasonably  large  V  (61 ) 
will  occupy  a  reasonably  small  number  of  chips  (16).  Such  a  cube  will  be  able  to  per¬ 
form  DFT  (n  )  for  n  <  (V2.  The  arithmetic  cube  can  also  be  programmed  to  compute 
the  cyclic  convolution  and  other  linear  transformations. 
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1.  INTRODUCTION 

It  is  axiomatic  that  to  support  the  design  of  current  and  future  defense 
systems,  ultra  high-speed  digital  computing  machines  will  be  needed. 

Technologically,  there  are  essentially  two  ways  to  build  faster  digital 

systems.  One  way  is  to  use  faster  electronics  and  the  second  is  architecture 
(the  art  of  integrating  together  functional  subsystems).  In  either  case,  one 
normally  begins  the  "top  down"  design  process  with  a  knowledge  of  whether  a 
general  purpose  or  dedicated  machine  best  satisfies  the  design  goals. 
University  of  Florida  scholars  have  a  long  history  of  researching  these  issues 
through  the  vehicle  of  advanced  arithmetic  unit  design,  fast  algorithm 
development  and  integrating  these  elements  together  into  highly  parallel 

systems  in  an  optimal  manner.  These  individual  discoveries  have  been 

converging  to  the  integrated  design  of  high-performance,  high-speed 

information  and  digital  signal  processing  systems  for  use  in  satellite 

communications,  emitter  identification,  radar  systems,  plus  many  others. 
These  applications  are  characterized  by  single  unit  (ALU)  with  a  real-time 
multiply/accumulate  rates  ranging  upwards  from  100  M  operations  per  second. 
In  this  paper  two  new  architectures  for  arithmetic  intensive  applicatlns  are 
proposed.  They  are  theoretically  based  on 

•  modular  arithmetic  (viz:  residue) 

•  homomorphic  arithmetic  (viz:  logarithmic) 

We  also  realize  that  speed  can  not  be  purchased  at  the  expense  of  cost,  power 
requirements,  or  reliability.  This  is  especially  true  in  high-volume 
applications  and/or  those  cases  where  limited  space/power  and  field 
maintenance  can  be  expected.  As  a  result,  Florida  research  has  been  paying 
much  closer  attention  to  these  companion  issues  and,  it  is  felt,  made  steady 
progress  in  this  direction  during  the  past  year. 


high-vol ume 
and  field 


2.  TECHNICAL  BACKGROUND  (RNS 


Recently,  in  major  tutorial  paper  on  the  subject  of  the  residue  number 
system  or  (RNS)  [Tay84],  it  was  established  that  this  ancient  branch  of 
mathematics  has  enjoyed  a  renaissance  over  the  last  half-decade  motivated  by 
the  promise  of  speed.  The  RNS  is  specified  by  a  moduli  set  of  L  relatively 
prime  integers,  say  P  =  {p, ,  ....  p.J,  GCD  (p^p.)  =  1  if  i  *  j  .  A 
number  in  this  system  is  expressed  as  a  L-tuple  of J integers  where  the  ith 
element  satisfies  X.  =  X  mod  p^  .  In  the  RNS,  the  sum-difference  and  product 
of  two  long  wordlength  integers  can  be  computed  as  a  set  of  L-concurrent  fast 
short  wordlength  carry-free  tasks. 

The  RNS  has  often  been  suggested  as  a  promising  media  in  which  to  design 
very  high-performance  real-time  (100  MHz)  special  purpose  signal  and  image 
processing  systems.  The  scenario  would  read  as  fol 1 ows : 
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•  Convert  a  real  signal  into  low  wordlength  (typ  6-12  bjt)  data 
sets  using  a  "flash  A/D  converter"  running  at  a  10°  to  10° 
sample  rate. 

•  Map  the  A/D  converted  n-bit  samples  into  a  RNS  L-tuple  using 
direct  table  lookup  conversions.  That  is,  if  logoCp^)  <_  m,  m 
sufficiently  small  (m  <  6  bits),  use  a  2m  x  m-bit  ROM  tc  map  a 
single  X  into  X^  =  Xmodp^  as  a  table  lookup  task. 

•  Perform  arithmetic,  restricted  to  add,  subtract,  or  multiply, 
using  direct  table  lookups.  That  is,  send  the  two  m  bit 
operands  Xj  and  to  a  22m  x  m  bit  table  which  contains  the 
precomputed  value  of  =  (X^Y^)modp^,  $  =  +,  -,  or  x. 

•  Convert  an  RNS  database  back  into  an  integer  in  order  to 
service  division  related  operations  or  to  output  results. 

Historically,  two  routines  have  been  used  to  convert  a  RNS  L-tuple  into 
an  integer.  One  is  known  as  the  Chinese  Remainder  Theorem  (CRT)  and  the  other 
Is  the  mixed  radix  conversion  (MRC)  algorithm.  They  map  a  residue  L-tuple 
into  an  integer  by  using  a  nested  sum  of  modular  partial  products.  However, 
the  bane  of  the  RNS  has  traditionally  been  the  inability  to  support  efficient 
high-speed  residue  to  decimal  conversion  or  flfOC) .  Unfortunately,  Tt  TT 
fundamentally  important  operation  and  Ts  found  Tn  magnitude  compare,  sign 
detection,  and  overflow  management  operations  and  must  be  dealt  with. 

2.0  THEORY  OF  COMPLEX  RNS  ARITHMETIC 

To  a  small  segment  fo  the  RNS  community,  a  new  wave  has  been  gaining 

momentum  in  the  area  of  the  complex  RNj>.  Within  the  framework  of  a  finite 

ring,  one  can  consider  the  roots  of  x^  =  -1  mod  p  .  If  x  Z  ,  then  x  is 
said  to  be  a  non-qucdratic  root  or  imaginary  [Her76],  Until  recently,  complex 
arithmetic  was  based  on  these  numbers  and  emulated  the  classic  rules  for 
complex  arithmetic  which  required  2  real  adds  per  add  and  4  real  multiplies 
plus  2  real  adds  per  multiply.  This  is  called  the  Complex  RNS  or  CRNS.  Now 
consider  the  use  of  Gaussian  primes  as  moduli  which  are  of  the  form  pi  =  4n  + 
1.  When  the  congruence  xz  =  -1  mod  p  has  an  integer  solution  (a  quadratic 
root)  such  that  j  =  -1  mod  p  and  j  e  Z  then  j  is  a  real  number.  It  has 

been  shown  that  there  is  an  isomorphic  mapping  of  a  compl ex  number  z  =  a  +  ib 

into  two  new  integers  $  (a,  b)  and  <fr  (a,  -b)  under  <fr(a  +  ib)  ♦  (a  +  jb) 
mod  p  such  that 

i.  (a,  b)  +  (c,  d)  =  ((a  +  c)modM,  (b  +  d)modM)  (1) 

ii.  (a,  b)  *  (c,  d)  =  ((ac)modM,  (bd)modM) 

The  importance  of  the  result  is  that  only  two  real  multiplications  and  no  adds 
are  required  to  do  complex  multiply!  We  have  also  proven  that  tdie  power 
operation,  such  as  the  pervasive  magnitude  squared  mapping  zz*  =  |z|  ,  can  be 

completed  In  one  real  multiply  versus  two  multi  pi  ies  plus  an  add  as  is 

currently  required  in  this  system  which  is  called  the  quadratic  RNS  or  QRNS. 

Researchers  at  the  University  of  Florida  have  achieved  several  milestones 
in  this  area  and  they  are 

•  Extensions  of  the  theory  of  the  complex  residue  number  system 
based  on  rings  of  Gaussian  integers  which  include: 


i.  that  the  reported  QRNS  isomorphism  does  achieve 

Winograd's  lower  bound  for  multiplication  complexity 
[Tay86i  ] . 

ii.  the  mathematical  machinery  to  synthesize  (i.e.,  derive 
other  than  divine)  the  QRNS  isomorphism  as  well  as  Wgher 
order  isomorphisms  (i.e.,  mappings  w.r.t.  p(x)  =  x^  +  1 
(QRNS),  x3  +  1,  ...,  x*  +  1)  [Tay86i ]. 

•  Implementation  of  practical  complex  RNS  units  based  on  "off- 
the-shelf"  hardware  [Tay85i],  This  is  suggested  by  the 
implementation  breakthrough  called  the  single  modulus  QRNS 
[Tay85i].  Based  on  the  primitive  quadratic  roots  of  2n  +  1, 
for  n  =  2,  4,  8,  16,  32,  we  can  now  design  complex  arithmetic 
units  which  we  have  theoretical ly  shown  (based  on  a  gate  level 
analysis)  to  be  potentially  100%  faster  than  conventional 
(non-residue)  units  in  essentially  the  same  amount  of  hardware 
(Figure  1). 

To  test  this  hypothesis,  the  gate  array  design  of  a  SM 
unit  was  undertaken  using  the  GE  C20000  CMOS  technology 
[Tay86vi].  The  design  consisted  of  the  following  on-chip 
modules : 


NEG 

SUM 

MDL 

MUL 


Negates  modulo  p  MUTT: 

Carry  lookahead  adder  JX: 

Sum  modulo  p  ITWO: 

Carry-save  unsigned  multiplier  ITWOJ: 


Product  modulo  p 
Scale  by  j 
Scale  by  2"1 
Scale  by  (2j  )-1 


where  JX,  ITWO,  and  ITWOJ  tasks  are  essentially  radix-2  binary 
shift  operations.  For  n  =  4,  8,  and  16,  execution  latencies 
were  simulated  using  TEGAS  which  demonstrated  a  speed 
advantage  for  the  CRNS.  In  particular,  QRNS(min)/CRNS(mi n)  = 
2.08,  QRNS(max)./CRNS(max)  =  1.32  which  was  found  to  be  in 
support  of  the  theoretical  predictions. 

•  Study  of  the  OFT  using  recently  developed  RNS  theory 
[Tay851i],  Based  on  a  gate-level  design  study,  a  conventional 
16-bit  radix-4  FFT  using  fast  multipliers  and  CLAs  was  found 
to  essentially  require  the  same  amount  of  hardware  at  the 
butterfly  level.  If  two  multipliers  are  committed  to  a 
design,  the  latency  became  35u  (u  =  gate  level  delay)  for  the 
conventional  unit  versus  21u  for  the  SM-QRNS! 


3.  TECHNICAL  BACKGROUND  (LNS) 

The  logarithmic  number  system  has  been  only  recently  studied  seriously 
[Swa75,Lee77,Swa83,Tay85i i ].  A  number  in  this  system  is  approximated  by 


x  =  +  r  x;  ex  =  [I  bits:  F  bits]  (2) 

where  arithmetic  is  performed  by  manipulating  exponents,  in  particular 
i.  MULT/DIV  C  =  AB  or  A/B  e  =  e  +  e. 

C  a  D 


(3) 


ii.  ADD/ SUB  C  =  A±B 


It  can  readily  be  seen  that  multiplicative  mapping  (i  &  iii)  are  trivially 
implemented.  The  ADD/SUB  operation,  denoted  $(v),  is  not  derived  but 
mechanized  as  a  ''fast'1  table  lookup  mapping.  As  a  result,  the  precision  of 
the  LNS  is  historically  limited  to  the  address  space  spanned  by  v.  For  a 
high-speed  (7-30  ns)  lookup  latency,  this  translated  to  typically  a  12-bit 
range.  Recently,  Florida  researchers  have  used  various  data  comparison 
schemes  to  extend  this  to  20  bits  using  a  ISL  technology.  The  62  Kmil^  390  mW 
6-function  processor  chip,  with  6-word  stack  (PUSH/POP,  TOP  and  NEXT  TO  TOP), 
possessed  the  following  attributes  (Figure  2): 

MULT/DIV  40  ns  Range  max  (1.84  x  101;! 

ADD/SUB  92  ns  min  (5.42  x  10~zo) 


SQR/SQRT  20  ns 


Preci sion 


2-12.52 


4.  INTEGRATION 

The  processors  detailed  in  the  previous  section  are  fast  and  possess  a 
regular  dataflow.  As  a  result,  they  are  likely  candidates  for  use  as 
"systolic  primitives."  The  technical  approach  to  integration  to  be  taken  is 
based  on  some  recent  published  work  of  Kung,  Moldovan,  and  Fortes  [Kun,R?, 
Mol 86,1^85].  This  work  describes  a  method  by  which  an  algorithm,  described 
by  a  difference  equation,  can  be  maped  onto  the  physical  rectangular  array. 
These  techniques  can  be  used  to  map  algorithms  into  systolic  or  wavefront 
arrays.  The  algorithm  to  array  mapping  is  given  in  terms  of  an  epimorphism 
(we  do  not  believe  it  is  an  isomorphism)  matrix  T  which  operates  on  a  matrix 
of  switching  prmitives.  Fortes,  for  example,  has  used  this  concept  to  develop 
a  gracefully  degraded  architecture  [For85],  Here,  if  in  an  nxn  array  the 
(i  ,j)  processor  fails,  the  ith  row  or  jth  column  (or  both)  would  be  taken  "off 
line."  This  meant  that  to  achieve  continuous  array  processing,  at  least  (n-1) 
perfectly  functional  units  were  removed  to  protect  the  system  from  an  isolated 
fault.  Even  though  this  may  appear  to  be  "radical  surgery,"  we  feel  that  the 
mathematical  machinery  and  the  refreshing  conceptualization  of  the  problem 
offered  by  Fortes  and  others,  points  us  in  a  very  promising  direction. 

We  have  developed  a  method,  using  incremental  difference  equations,  of 
predicting  the  "slope"  of  these  wavefronts.  We  also  feel  that  we  can  change 
the  direction  of  a  wavefront  through  a  dynamic  (during  real-time)  redefinition 
of  T.  We  call  this  wavefront  steering.  We  submit  that  through  the  proper 
application  of  this  concept  we  can  simultaneously  achieve  both  fault  tolerance 
and  load  balancing.  This  is  a  bold  statement  when  it  is  recalled  that  it  was 
previously  stated  that  these  are  generally  orthogonal  concepts.  However,  with 
the  steerable  wavefront  concept  can  achieve  the  following: 
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1.  Move  the  existing  tasks  aside. 

2.  Install  the  interrupting  task. 

3.  Restore  the  original  tasks. 

Load  balancing  can  be  achieve  using  one  of  the  following  procedures, 
i.  After  initiating  the  interrupting  task,  use  the  remaining 
array  space  to  restart  some  of  the  interrupted  tasks,  or 
ii.  Do  not  drive  all  of  the  interrupted  task  to  the  boundary. 

Exist  only  those  required  to  open  up  sufficient  space  to  begin 
execution  of  the  interrupted  program. 

5.  SUMMARY 

Our  residue  and  logarithmic  arithmetic  research  at  the  University  of 
Florida  has  recently  overcome  several  of  the  problems  which  has  limited  its 
prior  use  and  acceptance.  These  new  findings  are  now  being  extended  into  new 
areas  of  complex  arithmetic,  VHSIC/VLSI  and,  integration  into  a  viable, 
hlghly-parallel  numeric  Intensive  processor  designs  and  applications.  We  will 
also  plan  to  develop  the  mathematical  machinery  which  will  allow  the 
developed  computational  primitives  to  be  interconnected  into  a  rectangular 
array.  This  study  will  provide  a  common  framework  in  which  both  fault 
tolerant  and  load  balanced  arrays  can  be  designed  and  studied.  The  proposed 
research  in  this  area  will  be  a  synergism  of  theory,  design,  architecture,  and 
integration.  The  end  product  of  this  research  will  be  the  capability  of 
designing  high-performance  communications,  control,  and  image  processing 
systems  which  are  necessary  to  meet  many  future  mission  objectives. 

University  of  Florida 
1986 
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HARDWARE  IMPLEMENTATION  OF  SPECIAL  FUNCTIONS 
AN  APPLICATION  FOR  COMBAT  SIMULATION 
by 
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McLean,  Virginia  22102 

ABSTRACT 

This  paper  outlines  an  approach  to  improving  computer  performance  based 
on  reducing  key  algorithms  to  hardware.  This  takes  advantage  of  very  fine 
grain  parallelism  which  would  require  either  complex  control  or  unacceptable 
resource  management  overhead  to  handle  otherwise.  As  computer  aided  design 
and  fabrication  become  more  convenient  and  faster,  such  reduction  to  hardware 
should  become  as  much  a  programming  option  as  is  current  use  of  firmware. 

The  effort  outlined  is  to  explore  the  means  of  doing  this  by  reducing  a 
well  defined  algorithm  to  a  hardware  unit  that  can  be  plugged  into  the 
existing  system  used  to  execute  the  program.  The  algorithm  chosen,  hexagonal 
coordinate  system  addition,  is  an  important  calculation  in  the  CORBAN 
simulation  and  similar  software  of  importance  to  AirLand  Battle  Management. 
This  algorithm's  speedup  of  two  orders  of  magnitude  will  allow  speedup  of  the 
simulation  limited  only  by  the  fact  that  its  use  comprises  a  fraction  of 
simulation  computation.  Widespread  use  of  this  approach  should  result  in 
significant  speedup  when  applied  to  many  component  algorithms. 


A.  BACKGROUND 

Combat  Simulation  is  currently  computational ly  limited  on  existing  machines. 
For  example,  the  CORBAN  (Corps  Battle  Analysis)  Simulation  runs  only  about  as  fast  as 
real  time  on  a  Sun  microcomputer.  Orders  of  magnitude  better  performance  are  needed 
to  meet  the  analytic  applications  requirements  for  Battle  Management. 

There  are  several  approaches  to  improving  performance.  One  is  the  improvement 
of  the  compiler  and  other  system  tools.  This  may  yield  a  factor  of  two  performance 
improvement.  Another  approach  is  to  take  advantage  of  advances  in  the  state  of  the 
art.  The  new  68020  should  allow  a  factor  of  two  or  three  improvements  in  run  time, 
albeit  at  a  significant  increase  in  cost.  Software  structure  changes,  particularly 
unpacking  of  records,  can  trade  space  for  time  to  gain  perhaps  another  factor  of  two. 
Better  algorithms  for  functions  can  give  speedup  as  well.  For  example,  a  range 
calculation  that  currently  involves  several  coordinate  transformations  can  be 
approximated  with  acceptable  accuracy  with  arithmetic  no  more  complicated  than 
addition,  negation,  shifts,  and  one  multiplication. 

Parallelism  may  be  profitably  applied  to  make  dramatic  improvements  in  run  time, 
but  at  a  high  cost  in  hardware.  As  the  inherent  large  grain  parallelism  available  is 
reached,  the  efficiency  of  parallelism  declines,  to  the  point  where  marginal  costs 
are  not  worthwhile.  This  point  is  illustrated  by  previous  work  that  showed,  for  the 
Butterfly  parallel  processor,  a  decline  from  86  percent  efficiency  to  77  percent 
efficiency  when  the  number  of  processors  rises  from  64  to  124  for  a  simple  simulation 
with  a  typical  scenario  size  of  800  units.  (1) 
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Another  way  of  achieving  speedup  is  to  migrate  algorithms  from  software  to 
hardware.  The  effect  of  using  hardware  algorithms  is  to  capture  fine  grain 
parallelism  that  can  be  expressed  in  a  logic  diagram  but  not  in  serial  instructions. 
This  parallelism  is  unlikely  to  be  captured  in  any  other  way,  as  its  fine  grain  would 
impose  unacceptable  management  costs  or  require  a  very  complex  machine.  Use  of  a 
floating  point  processor  in  conventional  computers  is  an  example  of  this.  But  once  a 
fairly  limited  number  of  arithmetic  and  perhaps  transcendental  functions  are  so 
supported  with  hardware,  there  remains  quite  a  bit  of  processing  which  is  not  helped 
by  these  fairly  universal  operations.  It  is  on  these  algorithms  that  this  paper 
focuses. 

The  power  of  special  purpose  hardware  is  illustrated  by  the  recent  US  computer 
chess  championship,  where  the  top  two  programs,  HITECH  and  BEBE,  utilized  special 
hardware  which  was  designed  for  chess  algorithms  and  were  able  to  beat  the  previous 
champion,  CRAY  BLITZ,  which  uses  a  CRAY  XMP48.  (2) 

B.  THE  CANDIDATE  PROBLEM. 

In  the  combat  simulation  CORBAN,  there  are  several  utility  algorithms  that 
consume  a  significant  portion  of  the  processing.  If  many  of  these  can  be  converted 
into  hardware,  significant  speedups  may  be  achieved  in  parts  of  the  processing  that 
will  not  yield  much  to  other  approaches  described  above.  One  such  algorithm  has  been 
selected  to  illustrate  this  approach  and  as  a  candidate  for  a  demonstration. 

The  CORBAN  simulation  uses  a  coordinate  system  based  on  a  hierarchy  of  hexagons. 
(3)  (Hexagons  are  preferred  over  squares,  since  in  an  area  so  subdivided  any  two 
adjacent  regions  must  share  a  side.  This  eliminates  the  corner  problem  found  in 
square  grids.  Also,  ranges  calculated  in  hexes  are  accurate  enough  for  most  modeling 
purposes,  while  this  is  not  true  for  squares.)  Arithmetic  performed  in  the  hexagonal 
coordinate  system  consumes  a  significant  amount  of  computation  time.  As  the  CPU  does 
not  have  facilities  for  such  calculations,  this  must  be  performed  in  software. 

A  utility  often  called  in  CORBAN  is  the  Hex  addition  function,  HXADO.  On  the 
Butterfly  in  C,  calls  to  HXAOD  averaged  about  four  milliseconds  each.  An  assembly 
language  version  on  a  machine  faster  by  a  factor  of  two  required  about  400 
microseconds  per  call  on  a  12^  MHz  68000  with  no  wait  states.  This  improvement  is 
still  much  slower  than  the  two  microseconds  or  less  that  would  be  required  to  perform 
the  calculation  in  hardware.  If  similar  speedups  can  be  achieved  for  a  variety  of 
model  algorithms,  overall  performance  can  be  greatly  improved,  in  a  manner  that  can 
be  used  in  conjunction  with  other  improvements  such  as  parallelism. 

C.  THE  ALGORITHM 


The  logic  of  a  hardware  hex  addition  algorithm  is  illustrated  in  Figure  1.  It 
is  of  a  complexity  of  about  1000  gates,  and  requires  about  one  microsecond  for  gates 
with  4  ns  delay.  Table  1  summarizes  timing  calculations.  This  is  sufficiently  quick 
to  make  this  calculation  no  longer  a  factor  in  the  simulation  run  time.  This 
complexity  and  performance  is  well  within  the  capability  of  what  can  be  fabricated 
using  DARPA's  M0SIS  facilities.  A  chip  fabricated  in  this  manner  can  be  readily 
integrated  into  the  system  as  a  memory  mapped  device,  perhaps  even  inserted  as  a 
substitute  for  memory  in  one  or  two  RAM  sockets. 

This  particular  algorithm  is  one  which  is  especially  suitable  for  this  type  of 
approach  for  three  reasons.  First,  it  has  already  been  put  in  the  fastest  form 
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possible  with  software,  so  it  is  unlikely  to  benefit  from  other  speedup  attempts 
short  of  parallelism.  Second,  it  requires  no  memory  references,  and  thus  does  not 
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Figure  1.  Hex  Adder  Logic 


need  to  take  control  of  the  bus  in  the  manner  that  a  coprocessor  would,  simplifying 
the  logic.  Finally,  design  and  test  of  such  a  chip  is  straightforward  because  the 
algorithm  is  combinational  rather  than  sequential. 

TABLE  1.  HEX  ADDER  LOGIC  DELAY 


Binary  Full  Adder 
Hexagon  Half  Adder  (x3) 
Hexagon  Full  Adder  (x3) 

10  Digit  Hex  Address  Adder 


GATES 

DELAY 

13  Gates 

3 

39  Gates 

9 

117  Gates 

27 

1053  Gates 

243 

1/xs  at  4  ns  gate  delay 


Notes: 

1.  A  more  optimal  design  could  reduce  the  delay  by  adding  more  gates, 

reducing  the  hexagon  full  or  half  adders  to  logic,  thus  eliminating 
the  intermediate  steps. 

Z.  The  first  adder  in  the  10  digit  hex  adder  is  only  a  half  adder;  the 

last  adder  does  not  utilize  the  carry,  so  it  does  not  need  one  of  its 

half  adders.  Therefore,  the  total  complexity  (and  delay)  is  9  times 
that  of  a  full  adder. 
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The  algorithm  is  described  more  fully  in  Reference  4. 

0.  EXTENSIONS 

Table  2  lists  various  algorithms  in  CORBAN  which  consume  significant  computation 
time,  about  three  quarters  of  the  simulation  run  time,  which  would  benefit  from 
hardware  algorithm  approach.  The  several  hexagon  arithmetic  and  related  functions 
alone  comprize  about  a  quarter  of  the  simulation  computation.  Other  algorithms  in 
the  simulation  may  also  have  potential  for  speedup  with  this  approach,  but  have  not 
yet  been  investigated.  Individually,  each  one  might  only  give  a  marginal  speedup, 
but  collectively  the  speedup  may  be  very  significant.  The  hex  invert  (or  negation) 
operation  has  a  complexity  of  only  about  100  gates,  and  could  be  included  on  the  same 
chip  as  the  hex  add.  Hex  rotation  (or  multiplication)  would  also  be  straightforward , 
although  somewhat  more  complex  than  negation.  Conversion  from  hexagonal  to  (x,  y) 
orthogonal  coordinates  has  a  complexity  of  about  4000  gates,  and  is  quite  inefficient 
in  software. 

TABLE  2.  ALGORITHM  COMPUTATION  TIME  IN  CORBAN 


Entire 

simulation,  per  time  step 

347.87  sec 

HXADD 

(Hex  addition) 

24.16  sec 

6.91 

HXINV 

(Hex  negation) 

.10  sec 

— 

HXROT 

(Hex  rotation) 

13.62  sec 

3.95 

GETHX2 

(Hex  tree  traversal ) 

66.70  sec 

19.21 

PERCEV 

(Perception,  including  most  calls 
to  HXADD,  HXROT,  GETHX2 ,  and 

HXINV) 

137.83  sec 

39. 21 

HA2XYL 

(Hex  to  X.T  coordinates) 

4.91  sec 

1 .41 

HXOIST 

(Hex  distances) 

10.75  sec 

3.15 

AZIMTH 

(Azimuth  angle  between  hexes) 

39.26  sec 

11.31 

KILTGT 

(Target  attrition  calculation  including 
most  calls  to  HAZXYL ,  AZIMTH) 

64.85  sec 

18.65 

MOVSEL 

(Movement  selection,  by  tree  search) 

43.55  sec 

12.55 

EVALU8 

(Evaluate  decision  rules) 

7.18  sec 

2.15 

Total 

time  of  those  algorithms  listed  above 

266.93  sec 

76.75 

Total 

time  of  hex  utilities  listed  above, 

92.8  sec 

26.65 

excluding  traversals 


NOTE:  Totals  reflect  impact  of  nested  cells. 


Another  class  of  algorithms  for  which  a  hardware  approach  would  give  speed  up 
concerns  searching.  There  are  several  tree  or  list  data  structures  which  must  be 
searched  based  on  a  variety  of  criteria.  For  example,  one  such  search  traverses  a 
list  of  units  occupying  a- given  hexagon,  makes  random  draws  to  determine  what  is 
perceived,  and  indicates  which  are  detected.  A  hardware  approach  can  give  about  an 
order  of  magnitude  speedup  because  all  memory  references  are  data  accesses  necessary 
for  the  traversal;  none  are  needed  for  the  overhead  of  instructions.  This  type  of 
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device  is  more  complex  than  the  hex  addition  unit,  as  it  must  include  addressing 
logic.  It  must  also  be  designed  to  meet  coprocessor  specifications  for  the  host 

processor.  But  this  type  of  algorithm  makes  up  a  large  proportion  of  the  logic  not 
covered  in  the  previous  case,  so  use  of  this  approach  would  have  a  significant 

benefit.  How  general  in  purpose  a  particular  device  should  be,  a  matter  of  cost  and 
risk  traded  off  against  utility,  will  require  study. 

With  some  modification  to  the  representation  of  the  searched  space,  the  most 
often  executed  portions  of  the  search  algorithm  could  be  moved  onto  a  peripheral  chip 
using  a  small  memory  array  to  represent  occupancy  status  for  various  regions  in  the 
model.  This  would  build  on  the  logic  of  the  hex  adder.  Figure  2  illustrates  the 

logic.  This  approach  captures  most  of  a  major  model  algorithm,  currently  coded  in 

the  "Percev"  subroutine,  without  requiring  use  of  the  coprocessor  approach. 
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Abstract 

The  effective  and  efficient  use  of  the 
memory  hierarchy  of  the  computer  system  is 
one  of  the,  if  not  the  single  most  important 
aspect  of  computer  system  design  and  use 
Cache  memory  performance  is  often  the  limit¬ 
ing  factor  in  CPU  performance  and  cache 
memories  also  serve  to  cut  the  memory  traffic 
in  multiprocessor  systems.  Multiprocessor 
systems  are  also  requiring  advances  in  cache 
architecture  with  respect  to  cache  consistency. 
Similarly,  the  study  of  the  best  means  to 
share  main  memory  is  an  important  research 
topic  Disk  cache  is  becoming  important  for 
performance  in  high  end  computer  systems 
and  is  now  widely  available  commercially; 
there  are  many  related  research  problems 
The  development  of  mass  storage,  especially 
optical  disk,  will  promote  research  in  effective 
algorithms  for  die  management  and  migra¬ 
tion.  In  this  paper,  we  look  at  each  com 
ponent  of  the  memory  hierarchy  and  address 
two  issues  what  are  likely  directions  for 
development,  and  what  are  the  interesting 
rases  rdh  problems 
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1.  Introduction 

The  memory  hierarchy  of  a  computer 
system  includes  almost  all  aspects  of  a  com 
puter  system's  storage,  including  cache 
memory,  main  memory,  disk,  tape,  and  mass 
storage.  The  effective  and  efficient  use  of  the 
memory  hierarchy  is  one  of  the.  if  not  the 
most  important  aspect  of  computer  system 
design  and  uae  This  point  is  illustrated  by 
comments  in  two  recent  papers  see  also 

(M»ti84)): 

From  [Hopk83; 

The  performance  of  the  storage  hierar 
chy  may  be  more  important  than  any 
details  of  the  computational  instruction 
set  This  suggests  an  approach  for  those 
who  want  to  exploit  VLSI  design 
Expend  your  effort  on  the  memory 
hierarchy  not  on  exotic  instructions  " 

From  Mxtedd) 

"The  high  performance  tj  bit  micropro 
censors  introduced  over  the  last  five 
years  have  broken  ground  in  a  new 
market  for  microprocessors  high  perfor 
mance  systems  such  ss  engineering  and 
CAD  workstations  and  even  general 
purpose  mainframe  level  computers 
The  16  bit  microprosessors  generally 
have  pientv  of  computing  power  hut 
suffer  in  '  hese  application-  from  in 
inefficient  use  >f  memory  The  pmo  ip.i 
purpose  of  the  ij  hit  mu  ropro«  e»vir- 
now  reaching  the  market  «  to  iver>  mtie 
this  difficulty  and  »o  provide  effii  ieo’ 
engines  for  high  perftirman.  e  system  * 


"Designing  high-performance 

microprocessor-based  systems  requires 
viewing  the  memory  and  its  buses  as  the 
critical  elements.  DMA,  graphics,  and 
multiple  CPUs  must  all  contend  for  this 
resource,  and  the  key  design  criterion  for 
CPUs  intended  for  this  environment  is 
that  they  provide  high  levels  of  comput¬ 
ing  power  without  hogging  the  bus 

Each  component  of  the  memory  hierar¬ 
chy  is  important  for  a  different  reason  The 
access  time  to  the  cache  is  often  the  critical 
path  in  the  CPU  The  size  and  ease  of  access 
to  main  memory  affects  both  performance, 
through  memory  access  time,  paging,  swap¬ 
ping  and  frequency  of  other  I/O,  and  the  ease 
of  programming,  in  the  ability  to  trade  space 
for  running  time  and  programming  time  The 
efficient  use  of  disk  affects  performance,  the 
sue  of  the  disk  address  space  affects  the  ease 
of  programming,  the  cost  of  the  disk  system  is 
a  major  component  of  the  cost  of  the  overall 
computer  system  and  finally  the  physical 
size  of  the  disks  is  often  a  problem  in  the  com 
puter  installation  Efficient  use  of  tapes  and 
mass  storage  impacts  system  performance, 
and  overall  coat  by  its  effect  on  the  number  of 
disks  required,  operations  cost  and  operations 
errors. 

There  are  some  changes  common  to  all 
components  of  the  memory  hierarchy  First 
all  memories  are  becoming  logically  bigger 
Thu  u  due  to  two  related  trends  memories 
are  becoming  denser  and  are  also  becoming 
cheaper  thus  it  is  both  physically  and 
economically  possible  to  increase  the  memory 
sise  Second  the  increasing  performance  of 
the  processor  is  slso  necessitating  additional 
memory  The  performance  of  cache  and  main 
mam  ones  are  mcreaaing  steadily,  as  are  the 
capacities  of  disks 

Among  the  moat  important  trends  in 
genera  I  computer  system  design  is  one  toward 
multiple  CPU's  and  distributed  systems  This 
is  having  and  will  continue  to  have  an  impor 
’ant  effect  on  memory  hierarchy  design  and 
as  ei plained  below  accounts  for  some  of  the 
most  pressing  problems  in  memory  hierar 
'•hies 

In  the  remainder  if  this  paper  we 
present  an  op  to  date  view  of  problems  dire< 
’ions  and  issue*  in  memory  hierarchies  t h i « 


paper  can  be  considered  to  be  an  update  of 
two  previous  surveys  of  this  topic  by  the 
author  [Smit78e,  85a].  There  will  be  a  section 
each  on  cache  memories,  main  memory,  gap 
filler  memory,  disk,  and  tapes  and  mass 
storage.  Some  discussion  will  also  be  pro¬ 
vided  on  the  topic  of  the  "logical"  view  of 
memory,  as  opposed  to  its  performance  and 
physical  configuration 

2.  Cache  Memories 

Cache  memories  are  used  in  modern 
medium  and  high  speed  CPUs  and  new  high 
end  microprocessors  to  temporarily  hold  those 
portions  of  the  contents  of  main  memory 
which  are  currently  (believed  to  be)  in  use.  A 
thorough  survey  of  cache  memories  appears  in 
[Smit82]  and  has  been  updated  in  [Smit84]; 
we  assume  that  the  reader  is  familiar  with 
cache  memories.  An  extensive  bibliography 
appears  in  [Smit86], 

2.1.  Basic  Issues 

The  two  basic  performance  issues  in 
cache  memories  are  access  time  and  hit  ratio. 

Access  time  is  the  time  to  read  or  wnte 
cache  memory,  when  the  desired  information 
is  cache  resident,  and  hit  ratio  is  the  proba¬ 
bility  of  finding  the  target  of  the  reference  in 
the  cache  Access  time  u»  crucial  because  in 
many  most  or  almost  all  computers,  the  cache 
access  time  is  the  critical  path  m  the  machine 
and  is  the  factor  most  tightly  limiting  the 
cycle  time  and  overall  machine  performance 
The  hit  ratio  is  important  not  only  for  the 
traditional  reason,  that  it  affects  the  average 
accesa  time,  but  also  for  a  relatively  new  rea 
son  memory  or  bus  bandwidth  is  a  critical 
and  limiting  resource  m  multiprocessor  sys 
ferns,  and  the  hit  ratio  directly  affects  memory 
traffic 

2.2.  Multiple  CPU  Systems 

Recent  trends  in  computer  system  tech 
nology  ere  encouraging  the  development  of 
multiple  CPU  system*  There  are  two  reasons 
for  this  tendency  a  The  performance  of 
high  end  CPUs  is  not  keeping  pace  with  the 
demands  for  CPU  power  in  certain  applica 
tions  such  as  modeling  simulation  and 
numerical  computation  b  The  cost 
performance  ratio  is  now  significantly  better 
for  small  machine*  than  large  ones  which 
means  that  it  is  cheaper  to  get  a  given 


"amount”  of  computation  by  combining  many 
small  machines  than  having  one  (or  few) 
large  ones. 

The  basic  and  critical  architecture  and 
hardware  issue  in  multiple  CPU  systems  is 
that  of  resource  sharing.  Such  systems,  in 
particular,  may  share  many  of  the  parts  of  the 
memory  hierarchy,  such  as  main  memory, 
disk  and  mass  storage.  There  are  two  result- 
ing  problems:  (a)  maintaining  consistency  of 
shared  and  modifyable  information,  and  (b> 
avoiding  or  minimizing  queueing  and  arbitra¬ 
tion  delays  in  accessing  the  shared  resources. 
Cache  memories  are  directly  concerned  in 
both  problems. 

211.  Cache  Consistency 

In  the  case  that  processors  have  cache 
memories  and  also  share  main  memory,  the 
problem  is  to  ensure  that  the  many  processors 
see  consistent  values  of  the  shared  data. 
There  are  a  number  of  ways  to  do  that,  none 
of  them  entirely  satisfactory,  a  more  detailed 
discussion  of  these  appears  in  [Smit82,  84a, 
85e,  Swea86];  we  summarize  here.  (1)  The 
cache  can  be  shared;  this  solution  is  usually 
poor,  since  the  cache  doesn’t  have  sufficient 
bandwidth,  and  the  shared  design  increases 
the  access  time.  (2)  AH  writes  by  each  CPU 
can  be  broadcast  to  all  other  CPUs,  and  the 
relevant  lines  either  updated  or  purged;  this 
solution  fails  for  more  than  2  or  4  processors, 
as  write  traffic  begins  to  interfere  with  access 
to  each  cache.  (3)  Directory  methods  (see,  e  g. 
[Arch84]  for  a  recent  study)  maintain  distri¬ 
buted  (in  each  CPU)  or  centralized  (in  main 
memory)  directories,  that  ensure  that  for  each 
line  there  is  only  one  CPU  able  to  access  a 
line  that  has  been  or  is  about  to  be  modified 
(I.e.  many  readers  or  one  writer  )  Directory 
methods  are  expensive  to  implement  and  can 
slow  down  the  caches  and  memory  system  due 
to  the  need  to  synchronize  use  of  writable 
data  (4)  In  the  event  that  an  architecture  is 
new,  requirements  may  be  placed  on  the 
software,  causing  it  to  issue  certain  hardware 
commands  <e.g  cache  purge'  that  will  main¬ 
tain  consistency.  This  solution  is  only  feasi¬ 
ble  if  the  architecture  is  new,  so  that  old 
software  need  not  be  supported,  and  if  the 
appropriate  synchronizing  commands  have 
been  implemented  A  new  way  to  maintain 
consistency  via  software  is  discussed  in 
[Smit85ej,  in  which  "one  time  identifiers'  are 


suggested  to  avoid  unnecessary  cache  purges. 
(5)  A  special  type  of  directory  method,  most 
suitable  for  multiple  microprocessors  sharing 
the  same  bus,  has  recently  been  developed 
and  is  the  preferred  method  for  this  type  of 
system.  (See  [Good83],  [Swea86].»  In  it,  all 
microprocessors  have  logic  to  watch  the  bus 
and  ensure  that  the  1  writer  many  readers 
condition  holds  at  all  times.  The  principal 
limitation  of  this  method  is  that  the  bus  must 
be  shared;  the  method  does  not  extend  to  an 
arbitrary  number  of  processors.  It  also  needs 
to  be  extended  to  work  across  bus  repeaters. 

Each  of  the  above  methods  has  been  or 
is  being  implemented  in  one  or  more  systems, 
but  as  noted,  none  effectively  and  efficiently 
solves  the  general  problem  of  maintaining 
consistency  among  N  processors  sharing 
memory,  where  N  is  reasonably  large  (e.g. 
>16).  Finding  such  a  solution  to  the  cache 
consistency  problem  is  a  difficult,  significant 
and  important  research  problem. 

2.2.2.  Cache  and  Memory  Bandwidth 

The  second  major  problem  with  shared 
memory  is  that  of  memory  bandwidth,  and 
this  problem  is  most  apparent  in  the  case  of 
multiple  microprocessors  sharing  a  single 
memory  bus.  (See  [Bask76]  for  an  analysis  of 
interleaved  memory.)  One  would  like  to  con¬ 
nect  several  microprocessors  to  one  memory 
bus.  Using  a  high  performance  microproces¬ 
sor,  and  a  moderate  performance  bus,  it  is 
possible  to  saturate  the  bus  with  from  1  to  5 
microprocessors;  the  addition  of  more 
microprocessors  brings  no  increase  in  overall 
performance,  as  the  processors  spend  their 
time  waiting  for  memory  access. 

Cache  memories  are  one  of  the  primary 
mechanisms  to  solve  the  memory  bandwidth 
problem.  A  cache  memory  can  be  associated 
with  each  processor,  either  on  the  micropro¬ 
cessor  chip  or  off-chip.  If  such  a  cache  has  a 
16  byte  line,  a  5%  miss  ratio  per  instruction, 
reads  6  bytes  per  instruction  and  writes  2 
bytes  per  instruction,  and  uses  write  through, 
then  the  memory  traffic  has  dropped  from  8 
bytes  per  instruction  to  2.8  bytes  per  instruc¬ 
tion  With  copy  back,  and  half  of  the  replaced 
lines  being  dirty,  the  memory  traffic  drops  to 
12  bytes  per  instruction,  an  85^  decrease 
Thus,  cache  memories  will  be  necessary  for 
high  performance  multi-microprocessor  com¬ 
puter  systems.  The  trend,  then,  will  be  for 


future  high  performance  microprocessors  to 
have  on-chip  or  outboard  cache  memories. 
(However,  it  is  possible  to  design  a  cache 
memory  in  such  a  way  that  memory  traffic 
actually  increases;  we  are  presuming  good 
design.) 

2.3.  On  Chip  Cache 

With  increases  in  circuit  density  and 
chip  area,  it  is  becoming  possible  to  place  use¬ 
ful  cache  memories  on  the  microprocessor 
chip.  These  caches  serve  two  purposes:  they 
significantly  decrease  memory  access  times  on 
hits,  and  they  also  reduce  the  bus  traffic. 
There  are  a  number  of  design  issues  regard¬ 
ing  on-chip  caches;  these  issues  have  been 
addressed  primarily  in  the  context  of  large, 
mainframe  sized  caches  and  need  to  be  recon¬ 
sidered  for  small  caches.  Further,  the 
tradeoffs  are  somewhat  different  for  on-chip 
caches,  since  transfer  times  dominate  latency 
times  for  misses,  in  contrast  to  larger 
machines,  and  main  memory  traffic  is  also 
important.  These  differences  affect  optimal 
choices  for  parameters  such  as  line  size  and 
cache  organization;  in  [Hill84]  some  of  these 
parameter  choices  are  considered  and  the 
sub-block  cache  organization  (sector  cache)  is 
analyzed  and  evaluated.  Also,  VLSI  permits 
’’cheap  and  easy”  associativity,  which  affects 
the  design  (fully  associative  vs.  set  associa¬ 
tive)  of  caches  and  TLBs. 

The  research  issues  are  ones  of  evaluat¬ 
ing  and  selecting  cache  design  parameters  in 
the  context  of  small  size  and  limited  off  chip 
bandwidth,  and  the  availability  of  "easy” 
associativity. 

2.4.  Off  Chip  Cache 

Of  no  less  interest  are  off-chip  caches. 
Microprocessors  recently  announced  and/or 
under  development  are  sufficiently  powerful 
that  they  can  benefit  from,  and  in  fact  need, 
more  cache  than  current  technology  permits 
to  be  on-chip.  A  single  cache  chip  could  be 
expected  to  hold  4Kbytes  or  more  of  cache,  as 
with  the  Fairchild  CLIPPER  [Cho86],  and  a 
cache  board  could  easily  contain  upwards  of 
32Kbytes  of  cache.  Research  issues  here  are 
the  same  as  apply  to  cache  memories  in  gen¬ 
eral,  including  virtual  vs.  real  address  caches, 
data/instruction  caches,  multilevel  caches, 
line  size,  cache  size,  cache  organization  and 
associativity,  main  memory  update  algorithm, 


and  multicache  consistency.  Some  of  these 
are  further  discussed  below. 

2.5.  Multilevel  Cache 

There  are  trends  in  computer  architec¬ 
ture  suggesting  the  further  development  and 
use  of  multilevel  cache.  On-chip  or  on-board 
caches  may  he  too  small  to  be  fully  effective, 
but  are  much  faster  than  more  remote  caches. 
Similarly,  it  may  be  cost  effective  to  use 
different  consistency  methods  for  each  cache 
level,  with  different  cost  and  performance 
tradeoffs.  Fujitsu,  in  its  Facom  382  [Fuji82, 
Hatt83]  uses  a  two  level  cache  and  reports 
that  such  a  design  has  advantages.  We 
believe  that  multilevel  caches  will  become 
more  common. 

2.6.  Virtual  Address  Caches 

Most  cache  memories  are  addressed 
using  real  addresses;  see  [Smit82]  for  a  dis¬ 
cussion.  There  are  performance  advantages, 
however,  for  caches  addressed  using  virtual 
addresses,  as  is  done  in  the  Amdahl  580 
[Amda82].  We  predict  greater  use  of  this 
design,  especially  in  new  designs  and  archi¬ 
tectures  where  the  synonym  problem  can  be 
eliminated  or  avoided.  (The  synonym  problem 
occurs  when  two  different  virtual  addresses 
map  into  the  same  real  address.) 

2.7.  Data  and  Instruction  Caches 

Most  current  cache  designs  use  a  single 
cache  which  serves  for  both  instructions  and 
data.  The  advantage  to  splitting  the  cache 
into  instruction  and  data  halves  is  that  the 
bandwidth  is  doubled;  the  disadvantage  is 
that  if  instructions  can  be  modified,  then  new 
consistency  and  correctness  problems  arise. 
For  newer  machine  designs,  in  which  compati¬ 
bility  with  old,  self-modifying  software  is  not 
a  problem,  we  predict  that  split  caches  will 
become  increasingly  frequent. 

2.8.  Microcode  Caches 

In  one  special  case,  that  of  a  cache  for 
microcode,  the  workload  is  predictable  and 
static.  In  that  case,  the -cache  can  be  optim¬ 
ized  for  the  workload,  and  conversely,  the 
workload  (microcode  words)  can  be  specially 
modified  to  instruct  the  cache  with  respect  to 
fetching,  replacement  and  branching  The 
identification  and  parameterization  of  such 


optimizations  is  an  open  and  useful  research 
problem. 

2i.  Vector  Processors  and  Caches 

Vector  processors  rely  on  a  steady 
stream  of  data  to  drive  the  vector  unit.  Cache 
memories  can  be  a  problem  in  such  a  system, 
since  cache  misses  cause  the  vector  unit  to 
wait  until  the  main  memory  fetch  completes. 
This  can  affect  performance  and  can  also 
cause  increases  in  complexity  in  the  vector 
and  cache  control  logic.  (Page  faults  are  an 
even  worse  problem.'  The  proper  design  of  a 
cache  in  the  context  of  a  vector  processor  has 
not,  to  the  author’s  knowledge,  been 
addressed  in  the  research  literature  and  is  an 
interesting  problem. 

2.10.  Workload  and  Performance  Evalua¬ 
tion 

The  primary  technique  used  for  the  per¬ 
formance  evaluation  of  cache  memory  designs 
is  trace  driven  simulation.  There  has  been  a 
tendency,  however,  to  use  small  applications 
programs  for  these  traces,  and  then  to  find 
that  actual  cache  performance  is  significantly 
worse  than  predicted,  since  large  programs 
and  systems  programs  tend  to  dominate.  The 
problem  of  selecting  an  appropriate  workload 
is  very  important  to  any  cache  memory 
evaluation  or  research  effort  and  is  being 
addressed  in  [Smit85b], 

2.11.  Cache  Parameter  Evaluation 

Despite  the  large  number  of  papers  pub¬ 
lished  about  cache  memories,  there  is  a  shor¬ 
tage  of  hard  data  in  the  public  literature  that 
the  cache  designer  can  use.  For  example, 
what  is  the  quantitative  effect  of  varying  the 
line  size?  What  is  the  quantitative  effect  of 
changing  the  degree  of  associativity  in  a  set 
associative  memory?  The  former  has  been 
addressed  in  [Smit85c].  The  author  is 
engaged  in  studies  on  the  latter  with  a  gradu¬ 
ate  student.  The  relationship  of  miss  ratio  to 
cache  size  is  considered  in  [Smit85b,  c] 
Further  quantification  of  cache  design  param 
eters  and  tradeoffs  is  needed,  especially  for 
caches  in  a  multiprocessor  environment 

2.12.  And  So  What  Else  is  New? 

Cache  memories  have  existed  since  the 
late  I960’s  and  the  IBM  ,360  85  ;Lipt68i.  and 


yet  are  still  an  active  area  of  research  and  are 
an  important  aspect  of  computer  design;  the 
extensive  bibliography  in  [Smit86]  testifies  to 
this.  The  author  also  consults  widely  on  the 
subject  of  cache  memories  and  with  every  new 
design  and  system,  there  are  special  twists 
that  raise  new  issues.  We  expect  that  the 
activity  in  cache  memory  studies  will  con¬ 
tinue  for  several  more  years. 

3.  Main  Memory 

The  use  and  management  of  main 
memory  has  been  an  active  area  of  research 
since  the  early  to  mid  1960s;  see  [SmitTSd]  for 
a  large  bibliography  of  the  relevant  literature 
and  [Denn80]  for  an  overview  of  the  research. 
Traditionally,  the  research  issues  have 
stemmed  from  the  fact  that  up  until  the  late 
1970’s,  memory  was  expensive  and  thus  was  a 
critical  resource;  the  efficient  use  of  memory 
was  very  important.  Problems  relating  to 
paging,  such  as  replacement  algorithms  and 
control  of  the  degree  of  multiprogramming 
were  stressed.  These  problems  were  exacer¬ 
bated  by  the  many  computer  architectures 
with  too  few  bits  of  addressing,  e  g  16,  which 
also  limited  the  amount  of  memory  that  could 
be  usefully  used 

Within  the  last  few  years,  main  memory 
has  become  both  plentiful  and  cheap.  Thus, 
except  for  rare  cases,  memory  is  no  longer  a 
scarce  resource,  and  the  traditional  research 
problems  are  of  much  less  interest 

There  are  still  some  research  and 
development  problems  relating  to  main 
memory.  Probably  the  most  important  ques 
tion  is  how  to  design  a  cost  and  performance 
effective  shared  main  memory  For  example,  a 
crossbar,  such  as  was  used  in  C  mmp  WulfT'J' 
is  expensive,  somewhat  slow,  has  some  queue¬ 
ing  delays  [Bask76],  and  poses  reliability 
problems.  For  a  project  at  the  University  of 
Texas,  [Oppe83],  a  banyan  network  is 
planned  Many  of  the  various  multiprocessor 
research  and  development  projects  have  as 
their  central  issue  the  processor  a^d  memory 
interconnection  strategy  and  its  impact  on 
access  time,  cost,  reliability  and  queueing 
delays  This  is  and  is  likely  to  continue  to  he 
an  important  and  difficult  research  problem 

Another  important  question  which  has 
thus  far  been  left  almost  entirely  to  the 
memory  manufacturers  is  u  hat  ;«  rhi  Vm 


functional  design  for  a  high  density  memory 
chip?  Should  the  data  come  out  1  bit  wide,  4 
bits  wide,  8  bits,  or  more?  If  it  is  desired  to 
read  a  word  at  a  time,  and  a  word  is  32  bits, 
then  32  chips  one  bit  wide  are  required;  if 
these  are  256Kbit  chips,  then  32  chips  yield  a 
megabyte,  which  is  too  much  for  some  appli¬ 
cations.  It  has  been  remarked  to  this  author 
that  current  chip  designs  are  very  poor  for 
computer  designers;  memory  chips  should  be 
designed  to  latch  the  inputs  and  outputs 
[Koto84].  Now  the  external  logic  to  perform 
this  function  is  expensive  and  it  has  a  nega¬ 
tive  impact  on  performance.  Another  ques¬ 
tion  is  whether  "nibble  mode”  (by  which  addi¬ 
tional  sequential  bits  can  be  obtained  in  much 
less  time  than  the  first  bit  referenced)  is  use¬ 
ful  and  represents  a  good  design  choice. 

Hie  function  of  replacement  in  extremely 
large  main  memories  has  also  been  described 
to  the  author  as  a  problem.  The  clock 
replacement  algorithm  [Smit78c]  is  easy  to 
implement  given  only  a  reference  bit,  but  the 
number  of  page  frames  that  have  to  be  exam¬ 
ined  becomes  unreasonably  large  in  very 
large  memories.  The  use  of  set  associative 
replacement  for  main  memories  has  been  pre¬ 
viously  suggested  and  analyzed  [Smit78a]  and 
was  found  to  be  quite  effective,  the  use  of  set 
associative  clock  replacement  should  provide 
an  effective  replacement  algorithm  yielding 
adequate  performance  at  low  overhead  and 
needing  no  additional  hardware 

Another  aspect  of  the  replacement  prob¬ 
lem  is  to  find  a  way  to  have  a  consistent  refer¬ 
ence  hit,  when  there  are  multiple  cache 
memories  in  the  system,  and  the  reference  bit 
is  maintained  locally 

A  number  of  factors  suggest  that  optimal 
page  sues  will  increase  over  the  next  decade 
Large  pages  load  and  transfer  information 
with  fewer  page  faults,  if  memory  is  plentiful 
TLBs  <of  a  constant  number  of  entnesi  can 
address  more  memory  as  page  size  increases 
The  number  of  sets  can  be  increased  in  real 
address  caches  as  page  size  increases  since 
additional  bits  are  available  for  addressing 
without  translation  As  transmission  speeds 
to  and  from  I O  devices  increase  the  delay 
due  to  large  transfers  will  decrease  relative  to 
latency  and  will  become  less  significant 
Right  now  4K  pages  are  a  reasonable  choice 
M2  bytes  as  is  used  on  some  smaller  com  pul 
ers  is  clearly  much  too  small  We  believe  that 


over  the  next  decade,  page  sizes  of  8K  or  16K 
will  become  desirable. 

4.  Gap  Filler  and  Disk  Cache 

The  management  of  main  memory  has 
in  the  past  been  an  interesting  research  prob¬ 
lem  because  the  large  gap  in  access  times 
between  main  memory  t<l  microsecond'  and 
disk  storage  (10-100ms)  meant  that  transfers, 
especially  due  to  page  faults,  often  caused 
substantial  CPU  idle  time.  This  large  access 
gap  still  exists,  and  is  likely  to  get  worse, 
since  CPUs  and  main  memory  continue  to  get 
faster,  disk  performance  has  improved  very 
little  recently  and  is  not  likely  to  improve 
overall  by  a  significant  amount  [Hoag79].  The 
impact  of  the  access  gap  still  exists,  but  its 
focus  has  shifted  more  towards  explicit  I/O. 
Essentially,  the  problem  is  that  with  the 
increasing  density  of  disks,  their  nonincreas¬ 
ing  performance  and  the  9lowly  or  nonincreas¬ 
ing  number  of  data  paths  to  disk,  the  disk  sys¬ 
tem  will  be  unable  to  provide  sufficient  I/O 
bandwidth  to  serve  high  performance  CPUs 
and  multiprocessor  systems.  This  problem  is 
not  yet  major,  but  is  expected  to  become  worse 
in  the  next  few  years. 

The  existence  of  a  "gap  filler  technol¬ 
ogy",  i.e  one  intermediate  in  performance  and 
coat  between  main  memory  and  disk,  would 
possibly  provide  a  solution  to  the  performance 
bottleneck  projected  above  There  is.  how 
ever,  no  such  technology  available  Although 
a  few  years  ago  CCDs,  magnetic  bubbles  and 
EBAM  'electron  beam  accessed  memory'  were 
considered  promising,  none  is  viable  for  a  gap 
filler  role  at  this  time  nor  is  likely  to  be  soon 
•See  e  g  'Spec84!  > 

It  is  possible  to  make  the  useful  observa 
tion  that  there  is  significant  locality  in  I  () 
reference  patterns  data  is  both  referenced 
sequentially  and  for  some  data  sets  there  is 
significant  reuse,  see  SmitH.Sd  SmitT.o 
SmitTfi,  SmitTHb  for  data  supporting  thin 
observation  Thus  it  would  make  sense  to 
cache  portion-  of  the  disk  address  space  in 
mam  memory  or  outboard  in  an  Mt  >S  H  AM 
based  disk  cache  The  effectiveness  .it  thi- 
idea  is  shown  in  Smit*Od  There  are  already 
several  disk  cache  produits  including  the 
Sybercache  bv  N  I  <  '■'tor-'..’  and  the  IHM 

t HMD  mode  I  <  .  it  and  2  1  storage  controller- 
for  both  companies  prod  oil-  the  on  hr  i»  out 
hoard  at  the  storage  ontmller  NT  i  ha-  a 


disk  cache  which  is  associated  with  although 
outboard  to  the  CPU  [Toku80]. 

The  design  of  disk  caches  has  been 
neglected  in  the  research  literature  (except  for 
[Smit85d])  and  is  a  fruitful  area  for  study. 
Questions  to  be  answered  include:  what  is  the 
best  location  in  the  system  for  a  cache?  How 
large  should  the  cache  be  for  good  perfor¬ 
mance?  What  algorithms  should  be  used  for 
fetch  and  replacement?  How  large  should  the 
blocks  be9  Should  the  design  be  write- 
through  or  copy-back,  and  what  are  the  per¬ 
formance  implications  of  each0  The  answers 
to  these  questions  are  quite  sensitive  to  the 
workload  and  additional  workload  data  needs 
to  be  gathered. 

The  disk  caching  problem  also  extends  to 
distributed  systems.  A  recent  trend  in  com¬ 
puter  systems  is  toward  a  number  of  proces¬ 
sors,  including  personal  computers,  worksta¬ 
tions,  and  mainframes  linked  together  with  a 
local  area  network,  and  backed  by  a  file 
server.  In  such  a  system,  the  file  server  and 
the  network  are  limited  resources,  both  sub¬ 
ject  to  congestion,  and  the  overheads  of 
remote  I'O  are  substantial.  Thus,  there  are 
benefits  to  be  gained  in  caching  at  the  proces¬ 
sors,  either  on  local  disk  or  in  RAM,  and  also 
perhaps  to  caching  in  RAM  as  well  at  the  file 
server  itself,  to  avoid  unnecessary  physical 
disk  I/Os.  The  research  problems  of  caching  in 
a  distributed  system  include  not  only  those 
mentioned  above,  but  the  following:  should 
caching  occur  at  the  processor,  at  the  file 
server  or  both0  What  policies  and  parameters 
are  appropriate  at  each,  and  are  the  same  or 
different  choices  appropriate0  How  can  one 
maintain  consistency  in  such  a  system  when 
multiple  copies  of  the  same  data  can  exist0 

We  believe  that  the  use  of  disk  cache 
will  be  both  more  important  and  more  com¬ 
mon  in  the  future  The  number  of  interesting 
research  problems  in  disk  cache  will  also 
stimulate  related  research  activity 

5.  Disks 

Disk  technology  is  evolving  slowly  and 
has  been  for  some  time  [Hoag79]  Disk  den 
sity  is  increasing  at  about  200  per  year,  i  e  it 
doubles  every  3-4  years  Conversely,  disk 
arrest  times  are  improving  very  little  if  at  all , 
with  rotation  times  remaining  essentially  con¬ 
stant  and  arm  seek  times  dropping  at  a  slow 


rate.  These  trends  should  continue  at  com¬ 
parable  rates  over  much  of  the  next  decade. 

It  can  be  expected,  however,  that  disk 
I/O  transfer  rates  will  increase  significantly. 
Currently,  they  reflect  the  linear  density  of 
bits  on  the  disk  surface  and  also  the  data 
transmission  technology  between  disk  and  the 
CPU.  There  are  three  reasons  to  expect  this 
rate  to  increase:  ( a  i  Increasing  physical  bit 
density  on  the  disk  surface  will  increase  the 
rate  of  reading  and  writing  >b>  Technologies 
such  as  optical  fibers  are  becoming  available 
and  cost  effective  as  a  way  to  achieve  high  I  0 
rates.  < c)  Performance  considerations  suggest 
that  higher  transfer  rates  will  help  alleviate 
the  bottleneck  discussed  in  the  section  above 
on  gap  fillers. 

There  is  a  very  rapid  rate  of  change  and 
improvement  at  the  "low  end"  for  hard  disks. 
High  density  small  disks  are  rapidly  being 
incorporated  into  high  end  personal  comput¬ 
ers,  workstations  and  small  shared  computers. 
Performance  and  density  will  continue  to 
increase  and  cost  will  continue  drop.  Local 
disks  are  becoming  an  important  part  of  the 
memory  hierarchy  in  a  distributed  system  and 
will  become  more  so. 

Another  likely  change  in  disk  system 
design  is  to  associate  more  electronics  with 
each  disk.  These  electronics  will  act  to  correct 
errors,  buffer  tracks,  cache  information  and/or 
buffer  I/O  data  streams  so  as  to  mask  physical 
latency  and  decouple  transmission  from  physi¬ 
cal  disk  position  and  rotation  speed  This 
change  is  long  overdue  and  should  begin  to 
occur  at  any  time. 

There  are  a  few  areas  of  research  in  disk 
systems.  The  most  promising  is  the  general 
area  of  HO  optimization,  which  is  surveyed  in 
[SmitSlb];  see  also  [Smit81d],  With  changes 
in  technology,  optimal  solutions  to  issues  such 
as  block  size,  angular  placement  of  blocks, 
disk  loading,  etc  change  For  a  recent  imple¬ 
mentation  of  many  disk  optimizations,  see 
[Mcku84]  Another  research  problem  is  to 
determine  the  most  promising  uses  of  elec¬ 
tronics  in  the  disk  system. 

6.  Mass  Storage  and  Tape 

We  define  mass  storage  to  be  any 
storage  system  with  significantly  larger  capa¬ 
city  than  disk,  and  it  is  usually  cheaper  per 
byte  stored  and  slower  than  disk.  Included  in 


this  category  are  tape  and  tape  subsystems 
(such  as  the  IBM  3850,  the  Ampex  Terabit 
memory  and  the  Calcomp  automated  tape 
library)  and  optical  disk.  The  technology 
currently  undergoing  the  most  rapid  develop¬ 
ment  is  optical  disk.  We  believe  that  over  the 
near  term,  optical  storage  technology  will 
advance  the  most  rapidly,  and  within  5  or  10 
years,  very  large  optical  archival  stores  will 
become  common  in  large  computer  systems. 
No  other  technology  seems  to  be  competitive. 

Mass  storage  can  be  and  is  used  for 
explicit  I/O,  whereby  one  issues  read  and 
write  commands  to  specific  devices  and  data 
volumes.  The  most  promising  use  for  mass 
storage,  however,  is  as  an  automatic  backing 
store  for  disk.  In  that  circumstance,  the  disk 
address  space  would  be  expanded,  much  like 
virtual  memory  is  used  to  expand  main 
memory  addressing,  and  automatic  migration 
algorithms  would  be  used  to  move  data 
between  mass  storage  and  disk  as  needed. 

There  are  numerous  research  problems 
relating  to  the  efficient  and  effective  use  of 
mass  store  as  an  automatic  backing  store  for 
disk.  Questions  such  as  when  to  fetch  data, 
how  much  to  fetch,  when  to  remove  data  on 
disk  and  migrate  it  back  to  mass  store,  when 
to  compact  mass  storage  volumes  and  which 
volume  to  transfer  it  to  are  all  interesting  and 
significant  research  problems.  Some  relevant 
work  on  file  replacement  algorithms  and  file 
reference  patterns  appears  in  [Smit81a]  and 
[Smit81c];  see  also  [Lawr82]  for  more  work  in 
the  same  area. 

7.  Logical  View  of  Memory  Hierarchies 

In  addition  to  the  problems  discussed 
above,  which  are  primarily  concerned  with 
the  physical  design  and  algorithmic  use  of 
memory  hierarchies,  there  are  some  interest¬ 
ing  issues  having  to  do  with  the  logical  view 
of  the  memory  system . 

The  most  important  issue  is  one  of  how 
large  an  address  space  is  needed.  It  has  only 
been  recently  that  there  has  been  a  general 
realization  that  16  bits  of  addressing  are  not 
enough,  and  the  newest  generation  of 
microprocessors  use  32  bits  of  address.  Like¬ 
wise,  the  IBM  System  370  architecture  has 
been  extended  in  the  308x  series  machines 
[IBM82]  to  use  31  bits  of  addressing.  31  or  32 
bits  (4  gigabytes!  should  be  sufficient  through 


1990  or  1995  for  a  single  processor  system, 
but  one  can  expect  to  need  more  addressable 
memory  for  a  uniprocessor  sometime  in  the 
1990’s,  and  earlier  for  a  shared  memory  mul¬ 
tiprocessor  system.  We  expect,  therefore,  to 
see  architectural  changes  and  extensions  to 
permit  this  over  the  next  ten  years.  (Anyone 
designing  a  computer  at  this  time  would  be 
wise  to  keep  this  in  mind.  S/he  should  also 
avoid  partitioning  the  address  space  via  high 
order  bits.' 

The  idea  of  object  based  architectures 
and  capability  based  addressing  was  a  popular 
one  in  the  1970s,  and  machines  such  as  the 
Intel  432,  the  Cambridge  CAP  machine  and 
the  IBM  System/38  all  embody  and  use  such 
concepts.  Despite  the  advantages  in  program¬ 
ming  productivity  and  data  security  achieved 
by  that  approach,  the  current  belief  is  that 
there  are  inherent  performance  penalties  to 
such  architectures,  which  suggests  instead  the 
use  of  simple  load/store  (reduced  instruction 
set  -like)  architectures  with  simple  address¬ 
ing.  This  issue  is  still  an  appropriate  one  for 
research  and  it  may  yet  be  possible  to  use  a 
sophisticated  logical  addressing  scheme  and 
yet  have  good  performance. 

Protection  in  most  computer  systems  is 
primarily  associated  with  memory  and  the 
memory  hierarchy.  Protection  bits  are  usu¬ 
ally  associated  with  pages,  segments  and'or 
files.  Existing  protection  systems  tend  to  be 
unable  to  stand  up  under  sophisticated  pene¬ 
tration  efforts,  and  further  research  in  this 
direction  may  be  warranted. 

It  has  long  been  a  goal  of  computer 
designers  and  users  to  have  a  one  level  store, 
in  which  all  stored  information  would  be 
addressable  within  the  same  address  space. 
Virtual  memory  is  a  step  in  that  direction, 
but  does  not  usually  include  the  file  system. 
We  expect  some  additional  research  and  possi¬ 
bly  minor  commercial  moves  in  the  direction 
of  a  one  level  store,  but  over  the  near  term, 
results  are  not  likely  to  be  substantial. 

8.  Conclusions 

In  this  paper,  we  have  reviewed  memory 
hierarchies,  and  have  addressed  two  particu¬ 
lar  points:  (a'  what  are  the  likely  directions 
for  the  development  of  memory  hierarchies, 
and  'b!  what  are  the  interesting  research 
problems  As  was  explained  in  the  introduc- 


tion,  the  memory  hierarchy  is  one  of  the  or 
perhaps  the  most  important  part  of  the  com¬ 
puter  system  with  respect  to  both  performance 
and  utility.  We  therefore  believe  that  with 
respect  to  both  research  and  development, 
memory  hierarchies  will  be  a  central  area  of 
focus  over  the  next  decade. 
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Introduction 


Over  the  past  3  years,  the  U.S.  Army  Stra¬ 
tegic  Defense  Coomand  has  investigated  advanced 
signal  and  daca  processing  technology  for  obtain¬ 
ing  information  with  complex  infrared  (IR)  sen¬ 
sors.  The  results  of  these  investigations  have 
culminated  in  the  development  of  a  very  sophisti¬ 
cated  hardware-in-the-loop  (HWIL)  signal  and  data 
processing  testbed.  This  testbed,  shown  in  Figure 
1  below,  is  located  at  the  Army's  Advanced 
Research  Center  in  Huntsville,  Alabama.  This 
testbed  facility  is  now  being  used  to  conduct 
research  in  distributed  data  processing  architec¬ 
tures,  algorithms,  and  real-time  applications 
software  for  a  wide  range  of  missions. 

The  signal  and  data  processing  testbed  con¬ 
sists  of  a  System,  Environment,  and  Threat  Simula¬ 
tion  (SETS)  and  an  Algorithm  Testbed  Simulation, 
both  of  which  run  on  an  array  of  ten  VAX  11/780 
computers.  This  nonreal-time  code  encompasses  all 
algorithm  performance  enhancements  without  regard 
to  real-time  constraints  and  serves  as  a  standard 
by  which  the  real-time  algorithms  are  compared. 


These  simulations  were  researched  and  established 
by  Nichols  Research  Corporation  in  Huntsville, 
ALabama . 

The  Mosaic  Sensor  Emulator  Unit  (MSEU)  emu¬ 
lates  a  sensor  module  containing  1,920  detectors 
and  outputs  real-time  waveforms  from  each  detector 
for  1,000  targets  for  20  scans  (20,000  targets  for 
1  scan).  The  Auxiliary  Sensor  Signal  Processing 
Unit  (ASSPU)  performs  time-dependent  processing 
(1,700  Million  Instructions  Per  Second  (MIPS)), 
daca  partitioning  processing  (556  MIPS),  and 
object-dependent  processing  (64  MIPS)  before 
presenting  three-color  information  to  the  data 
processor.  These  systems  were  designed  and  built 
by  Boeing  Aerospace  Company  in  Seattle,  Wash¬ 
ington.  The  Advanced  Distributed  Onboard  Pro¬ 
cessor  (ADOP),  a  1 5-MI  PS  system,  then  does  the 
measurement  processing,  color-to-color  correla¬ 
tion,  scan-to-scan  correlation,  precision  track, 
discrimination,  and  threat  reporting.  This  system 
was  designed  and  built  by  Honeywell,  Inc.,  in 
Clearwacer,  Florida.  The  real-time  application 
code  and  part  of  the  operating  system  were 
designed  and  implemented  by  TRW  in  Huntsville, 
Alabama. 
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Hardware  Architecture  Descriptions 


The  MSEU  and  ASSPU  systems  are  expertly 
described  in  the  Wednesday,  November  6,  evening 
classified  session  paper  entitled  "Electronic 
Emulator  and  Signal  Processor  for  BMD  Mosaic  IR 
Sensor"  by  Norsworthy  and  Franzel  and  will  not  be 
repeated  here. 

The  ADOP  comprises  five  processing  elements 
(nodes)  that  communicate  on  three,  1-Mword/sec 
independent  global  buses.  Each  node  contains 
three  Central  Processing  Units  (CPUs),  three 
Floating  Point  Processors  (FPPs),  one  million  (1M) 
words  of  17-bit  memory,  and  a  global  Bus  Interface 
Controller  (8IC). 

The  ADOP  node  architecture,  shown  in  Figure 
2,  comprises  four  basic  functions: 

(1)  A  fixed  point,  MIL-STD-1750A,  CPU 
capable  of  executing  the  fixed  point  Digital 
Avionics  Instrumentation  System  (DAIS)  instruction 
mix  at  1  MIPS.  The  CPU  implements  a  pipelined 
instruction  queue  and  embedded  memory  management 
unit  function  to  achieve  this  throughput  while 
addressing  up  to  1M  words  of  memory.  The  basic 
CPU  microcycle  time  is  350  nsec. 

(2)  A  FPP  serves  as  an  adjunct  to  the  CPU  to 
facilitate  achievement  of  1  MIPS  full  DAIS 
floating  point  operation.  Using  the  built-in 


function  generator  option  of  MIL-STD-1750A, 
special  macro  instructions  can  be  coded  to  perform 
special  operations  for  signal  processing  opera¬ 
tions.  The  FPP  contains  a  hardware  multiplier  and 
an  ALU,  which  essentially  "shadows”  the  CPU 
arithmetic  logic  unit. 

(3)  A  four-port  local  memory  organized  as 
256K  words  by  17  bits  (1  parity  bit).  A  read- 
modify-write  function  facilitates  semaphore 
operations  required  by  the  operating  system.  The 
system  of  node-local  buses  provides  access  to  all 
of  memory  by  each  CPU. 

(4)  A  BIC,  when  coupled  with  the  real-time 
operating  system,  efficiently  handles  all  global 
communications.  The  BIC  accesses  message  queues 
in  node-local  memory  and  moves  messages  on  the 
system  of  Global  buses.  A  logical  addressing 
scheme  allows  the  routing  to  be  configured  by  the 
operating  system  without  affecting  application 
software. 

The  ADOP  node  structure  allows  virtually 
contention-free  processing  of  an  application 
program  when  data  sets  are  partitioned 
nonoverlapping.  To  this  end,  a  complete  set  of 
software  tools  has  been  developed,  including  a 
PASCAL  compiler  and  a  distributed  operat.ng 
system.  The  use  of  these  tools  provides  a  strict 
adherence  co  programming  standards  snd  com.  i.  ra¬ 
tion  controls  thac  make  testing  on  the  mu . 1 : -CPU . 
multi-node  architecture  an  achievable  tas*. 
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Ttdtbcd  Utilization 

The  testbed,  in  whole  and  in  part,  provides  a 
facility  for  several  SOI  research  endeavors, 
including  the  Airborne  Optical  Adjunct  ( AOA ) ,  the 
Airborne  Optical  Sensor  (AOS),  nonnuclear  inter¬ 
ceptor  avionics,  and  red-blue  situations.  One  of 
the  first  research  efforts  on  the  testbed  was 
conducted  by  Teledyne  Brown  Engineering  (TBE), 
Huntsville,  Alabama,  using  the  currently  imple¬ 
mented  algorithm  set  bated  on  an  optical  probe. 
The  TBE  research  team  studied  the  sensitivity  of 
the  software/hardware  performance  to  varying 
threat  characteristics  such  as  the  number  of 
reentry  vehicles,  angular  rates,  object  spacing, 
and  "clumped"  objects.  These  researchers  are  now 
studying  other  proposed  algorithms,  as  well  as  new 
re ;  ormu 1  a 1 1 ons  of  the  existing  ones,  for  improved 
perf  ormance . 


SETS;  however,  before  it  can  be  used  to  drive  the 
MSEU,  it  must  be  converted  into  instructions  that 
the  emulator  can  use  for  generation  of  the 
appropriate  sensors  signals. 

A  series  of  sophisticated  software  programs 
are  used  to  transform  the  basic  threat  data  into 
specific  optical  image  data  for  each  detector  in 
the  sensor  focal  plane  array.  These  data  form  the 
basis  for  creation  of  the  MSEU  instruction  set. 
The  MSEU,  in  turn,  generates  the  signals  that  are 
used  as  input  to  the  ASSPU.  These  signals  include 
detector,  background,  and  gamma  noise  to  further 
stress  the  ASSPU.  In  both  the  SETS  and  the  MSEU- 
ASSPU,  the  threat  data  is  processed  so  chat  the 
output  from  each  is  in  a  common  format  and  forms 
the  input  data  for  either  the  data  processing 
algorithm  simulation,  the  reformulated  algorithms, 
or  the  ADOP. 


Development  of  threat  scenario  data  is  the 
first  step  in  the  utilization  of  the  testbed.  Ac 
the  present  time,  all  of  the  threat  data  ucilized 
was  developed  by  TBE  as  part  of  the  forward 
Acquisition  System  Integrated  Cround  Test  (FAS 
ICT)  Program.  These  data  characterize  the  threat 
m  terms  of  the  number  of  reentry  vehicles, 
Jec.jvs,  tragments,  and  ocher  objects;  the 
spacing*;  angular  rates;  irradiance  parameters; 
etc.  The  scenario  must  be  designed  in  such  a  way 
•"ar  it  accomp.iahes  two  purposes.  First,  it  must 
be  rred.ble  in  terms  it  its  representation  of  a 
po'ent.al  ballistic  missile  attack  based  on 

available  intelligence  data.  Second,  it  must  be 
designed  to  provide  the  stressing  criteria 
required  to  e-.a.uate  the  hardware  ■  so  f '  ware  under 
ana.vsis.  A  general  scenario  as  it  appears  in  the 
•ie.d  ot  y.ew  ( FOV )  of  e  typical  scanning  mosaic 
le-i.r  is  shown  in  Figure  ),  with  the  focal  plane 
i"c<  ;•  er-ed  >nt  ->  the  FOV  for  clantv.  The 
1  "  •  •*  i  ’  la’s  ne-.  me  s  'he  "truth”  .pon  wh.  .:h 
. e - ■  rma" if  measj'es  tor  the  hardware  and  sottware 
»r*  te-e-m.-ed  j'lc’  ;c»,  a  subset  ot  the  FAS 

11*  '-'»a'  'ha'  nee's  the  previously  defined 

'  '.•  a  s  se.ec'ed  ?h  s  subset  is  'hen  used  as 

'•e  pr.nary  oa'a  input  tor  be  t  n  the  5E“"S  and  the 
wsr  . 
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The  algorithm  simulation  contains  a  full 
complement  of  maximum  performance  algorithms  Chat 
run  in  nonreal  time  on  the  VAX  11/780  neework. 
The  output  from  this  simulation  is  Chen  compared 
to  the  "truth"  data  to  establish  a  new  standard  to 
compare  with  the  real-time  algorithms  or  actual 
performance.  The  output  of  the  SETS  or  signal 
processor  is  next  used  to  drive  the  reformulated 
algorithm  set,  again  in  nonreal  time.  This 
reformulated  algorithm  set  is  the  set  of 
algorithms  that  have  been  designed  (reformulated) 
t  rom  the  maximum  performance  algorithms  to  run  in 
real  time  when  mapped  onto  the  ADOP  hardware.  The 
results  derived  from  the  reformulated  runs  are 
used  to  predict  the  performance  of  these 
algorithms  when  implemented  on  the  ADOP. 

The  final  step  is  to  use  the  output  from  the 
SETS  and/or  Che  signal  processor  co  exercise  the 
real-time  algorithms  on  the  ADOP  to  determine  the 
actual  performance.  This  cestbed  configuration, 
when  used  in  the  manner  just  described,  can 
support  several  important  research  projects  for 
the  Strategic  Defense  Initiative  (SDI)  Program.  A 
typical  output  from  che  system  is  depicted  in 
Figure  4,  which  shows  the  "truth"  performance,  the 
algorithm  simulation  performance,  and  the  ADOP 
pe  r  f  o  rmance . 
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< ure  A.  Scan-To-Scan  Performance. 


The  Dace  Processing  Directorate,  within  the 
U.S.  Army  Strategic  Defense  Command,  working  in 
concert  with  Teledyne  Brown  Engineering, 
Honeywell,  Boeing,  TRW,  and  MRC  has  developed  an 
advanced  long  wave  infrared  (LWIR)  signal  and  data 
processing  testbed  for  studying  and  characterizing 
performance  for  various  BMD  missions.  A  full  set 


of  LWIR  sensor  signal  and  data  processing 
algorithms  has  been  developed  ior  a  probe-type 
sensor  and  many  experimental  results  have  been 
obtained.  In  addition,  a  complete  complement  of 
support  software  has  been  developed  for  this  HWIL 
facility.  Analysts  are  currently  conducting 
research  studies  on  several  key  technology  issues 
for  the  SDI  Program. 
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On  Parallelism  in  Software/Hardware  Design  Tools 

P.A.Subrahm&ny&m 
AT&T  Bell  Laboratories 

Abatrset.  This  paper  explores  opportunities  for  parallelism  in  some  classes  of  computer  aided 
design  tools,  and  attempts  to  convey  a  few  insights  about  how  one  might  go  about  exploiting 
this  parallelism.  More  specifically,  perspectives  that  bear  on  the  design  of  hardware  accelerators 
for  some  important  classes  of  design  tools  are  provided;  however,  the  actual  details  of  such 
architectures  are  not  specified. 

Keyword s:  Parallel  architectures,  Design  aids,  integrated  system  design,  reusable  software. 


1.  Introduction 

It  is  widely  acknowledged  that  the  design  of  software/hardware  is  complex,  error  prone,  and 
therefore  costly.  Two  broad  classes  of  attempts  to  address  this  problem  include  various  flavors 
of  design  methodologies  and  computer  aided  design  tools.  In  this  context,  a  "design 
environment"  usually  consists  of  some  collection  of  tools  supporting  some  combination  of 
methodologies  for  the  tasks  concerned.  While  significant  progress  in  design  environments  has 
clearly  been  made  in  the  last  few  years,  there  is  still  an  urgent  need  to  make  such  design 
environments  conceptually  cleaner  and  faster.  A  naive  way  to  improve  existing  execution  speeds 
is  to  build  special  purpose  hardware  for  each  tool.  This  seems  impractical;  a  major  problem  lies 
in  fact  that  there  are  diverse  concepts,  terminology,  and  tools  involved  in  a  sophisticated  design 
environment,  and  there  is  a  tremendous  cost  overhead  associated  with  designing  special  purpose 
hardware. 

An  alternative  suggested  here  is  that  based  on  some  notions  of  what  is  conceptually  cleaner,  and 
that  are  in  consonance  with  emerging  trends  in  formally  based  approaches  to  specification, 
programming  languages  and  design  paradigms,  we  can  partition  a  design  environment  in  a 
manner  that  potentially 

•  enables  a  better  ammortization  of  development  costs  for  (a  class  of)  specialized  HW  tools, 
and 

•  supports  parallelism. 

The  rest  of  this  paper  elaborates  mainly  on  these  issues.  In  the  next  section,  we  summarize 
some  of  the  major  activities  involved  in  the  design  process,  suggest  paradigms  for  carrying  out 
these  activities,  and  then  focus  on  the  tools  needed  to  support  these  paradigms.  The  objective 
here  is  to  highlight  the  fact  a  surprisingly  large  number  of  activities,  although  not  all,  can 
benefit  from  a  common  set  of  tools  (provided  that  adequate  attention  is  paid  to  the  global 
architecture  of  such  design  aids).  The  importance  of  this  observation  is  that  it  suggests  a 
partitioning  of  the  tools  in  a  design  environment  that  enables  a  much  better  ammortization  of 
hardware  development  costs  for  special  purpose  engines.  The  structure  of  the  resulting  design 
environment  is  then  examined,  with  a  view  to  exploring  the  potential  for  parallelism  in  its 
execution.  We  classify  the  various  forms  of  parallelism  possible,  and  indicate  what  flavors  of 
hardware  architectures  may  be  suited  to  support  the  execution  of  advanced  design  tools  of  this 
nature. 

A  subliminal  assumption  made  here  is  that  formally  based  approaches  to  design  -  both 
methodologies  and  tools  -  have  the  advantage  of  enabling  provably  correct  designs,  and 
encourage  the  use  of  rigorous  specifications  at  a  high  level  of  abstraction.  In  turn,  this  enables 
rapid  prototyping  of  the  specifications,  and  tends  to  reduce  overall  design  and  maintenance 
costs.  Furthermore,  the  software  tools  used  in  such  contexts  are  arguably  conceptually  clearer 


H 


and  more  robust:  They  are  typically  based  on  a  standard  infrastructure,  i.e.,  one  where  the 
domain  has  been  well  studied  (such  as  formal  syntax  and  semantics  of  languages,  algebra  or 
logic);  and  tools  (such  as  grammar-driven  editors  and  symbol  manipulation  systems)  have 
evolved  that  support  typical  operations  in  these  domains. 

The  overall  cost-effectiveness  and  robustness  of  design  environments  that  employ  such  tools  can 
be  greatly  improved  if  components  of  a  design  aid  that  are  aimed  at  different  activities  can  use 
a  common  pool  of  tools.  Additionally,  the  potential  advantages  are  further  enhanced  if  there  is 
some  parallelism  intrinsic  to  such  systems  that  is  amenable  to  intelligent  exploitation, 
particularly  via  hardware  accelerators.  The  main  motivation  for  the  research  discussed  in  this 
paper  is  to  explore  design  tools  that  possess  such  characteristics. 
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Figure  1.  Design  Activities,  Paradigms,  and  Tools 


2.  Design:  Activities,  Paradigms,  and  Tools 

Figure  I  indicates  some  of  the  major  activities  involved  in  the  design  process,  suggest  paradigms 
for  carrying  out  these  activities,  and  then  focus  on  the  tools  needed  to  support  these  paradigms. 
This  figure  is  designed  not  to  be  exhaustive,  but  merely  to  convey  a  flavor  of  some  of  the  major 
activities  involved  in  integrated  system  design.  Our  main  intent  here  is  to  establish  a  context 


for  wha*  follows,  rather  than  to  discuss  involved  in  great  depth 

The  activities  indicated  in  Figure  1  include  problem  specification,  rapid  prototyping  of 
specifications,  program  or  hardware  generation,  and  maintenance  Problem  specification 
involves  an  explicit  statement  of  the  problem  in  a  programming  language  While  this  can  m 
principle  be  any  programming  language,  there  are  cogent  arguments  as  to  why  this  should  be  a 
high  level  language  with  a  declarative  flavor,  and  more  specifically  one  that  minimi  s  the 
amount  of  representation  dependent  information  that  must  be  specified  l’3|  In  view  of  this,  the 
mode  of  specification  depicted  suggests  the  use  of  functional,  equational  or  logic  based 
techniques. (4).  Good  data  structuring  and  knowledge  representation  capabilities  are  also 
needed.  These  are  being  supported  in  the  context  of  evolving  systems  in  the  context  of  logir 
and  relational  systems{2). 

To  help  "debug"  the  specifications,  facilities  for  rapid  prototyping  are  needed  This  implies  the 
need  for  support  tools  like  syntax  and  semantics  based  editors,  interpreters  and  symbolic 
execution  facilities  Such  tools  are  increasingly  driven  by  the  syntax  and  semantics  of  the 
language  grammar  as  opposed  to  being  custom  coded. 

The  next  phase  usually  involves  the  generation  of  either  software  or  hardware  that  consistently 
implements  these  specifications.  This  may  be  done  using  traditional  compiler  techniques,  or  a 
somewhat  generalized  form  of  compilation  that  may  be  conceptualized  as  transformation  based 
techniques.  The  actual  generation  of  hardware  designs  involves  other  lower  lower  level  activities 
such  as  symbolic  IC  design,  cell  layout,  simulation  at  various  levels,  a  posteriori  verification, 
testing  etc.  In  both  cases,  the  tools  needed  include  term  (or  tree)  manipulation  sysirms. 
semantics  driven  tools.  Source-to-source  transformations  additionally  need  various  forms  of 
formal  manipulation  or  term  rewriting  systems,  along  with  limited  kinds  of  theorem  proving 
abilities.  Strategy  guidance  aspects  of  such  systems  need  the  ability  to  deal  with  performance 
or  complexity  measures,  and  this  in  turn  relies  on  symbol  manipulation  tools  like  those  found  in 
Macsyma,  for  instance.  Other  activities  such  as  verification,  simulation,  layout  optimization, 
and  maintenance,  also  benefit  from  tools  of  this  flavor. 

Our  main  intent  here  is  to  emphasize  the  somewhat  central  role  of  this  class  of  tools,  should 
they  exi&t,  rather  than  to  suggest  an  exclusive  set  of  tools  and  paradigms  A  consequence  of  this 
observation  is  that  while  there  are  diverse  tasks  involved  in  the  overall  system  design  process, 
there  seems  to  be  a  set  of  tools  that  is  frequently  used  by  a  fairly  broad  spectrum  of  these 
activities.1  Of  course,  this  observation  is  not  entirely  new,  but  given  the  emerging  programming 
paradigms  that  merge  object  based  and  logic  programming, [2]  as  well  as  the  interest  in  using 
similar  techniques  in  the  hardware  domain, [7]  we  believe  that  the  existing  commonalities  are 
much  broader  than  was  perhaps  originally  recognized. 

In  view  of  this  observation,  we  will  next  comment  on  the  potential  for  parallelism  in  such  a 
context,  and  attempt  to  provide  a  basis  for  discussing  the  design  of  hardware  accelerators  for 
such  tools. 


3.  Parallelism  in  the  Implementation 

The  overall  parallelism  in  a  design  environment  can  stem  from  essentially  three  main  sources, 
since  we  can  consider 

1.  The  specific  tools  in  this  kernel  is  expandable,  and  will  depend  upon  the  paradigm  adopted  for  addressing  each  of 
these  activities. 


•  Parallelism  in  t  he  system  »ii\  ir*>n men 1 1  - 1  r  < i •  t  ■  i r •* 

•  Parallelism  in  the  underlying  t < •* >1  structure 

•  Parallelism  in  the  language  support  structure  i  e  'he  imj  si  e  -ii  -f  'he  !hiciik<  ‘ 

tools  are  written  in 

In  discussing  hardware  support  for  such  parallelism  will  fur  smipli.it  \  '.n«id<-r  tr  <t 

classes  of  such  parallelism  that  are  achievable  via 

.  tightly  coupled  processes  (shared  niemorv  architect ure« |  slid  w  » 

•  loosely  coupled  processes  (message-passing-based  »r<  lute,  t  ur*-*  ' 

We  first  comment  on  the  potential  parallelism  it  the  l«\>-l  ,f  ib-ir  i.  •  j.r. t  h  o  in . 

the  problem  st  rue  t  lire  We  then  look  at  the  at  t  ribut  e«  of  t  In-  »l-t  r  i.  t  par  t  it  ton  mg  m  1  <  r  in  ■  l 
possible  implementations  using  the  two  dominant  paradigm-  f-.r  building  hardwire  ti  int'  n 

shared  memory  and  message-based  architectures 

3.1  Parallelism  in  the  System  Structure 

We  first  consider  the  parallelism  in  the  structure  •■(  a  design  environment  Each  of  the 
activities  in  the  "activity"  diagram  may  be  executed  "concurrent ly ''  on  a  loosely  coupled  system 
In  fact,  this  can  be  done  in  several  cases  in  a  pipelined  or  "systolic"  sense  because  of  tlo 
input  output  dependencies  involved  Some  of  the  activities  themselves  may  be  parallelized  e  g 
simulation  of  large  networks  Since  a  large  number  of  activities  share  a  common  set  of  tools 
additional  processing  power  is  exploitable  via  duplication  of  tools  The  barm-  parallelism 

available  at  this  level  is  that  exploitable  by  multiple  users  timesharing  hard*  -re  accelerators 
that  support  a  common  kernel  of  tools 

3.2  Parallelism  in  the  underlying  tool  structure 

To  investigate  parallelism  in  the  underlying  tool  structure,  it  is  obviously  necessary  to  consider 
each  tool  individually  As  an  example,  we  consider  here  an  important  but  somewhat  non¬ 
standard  tool:  clause  based  automated  reasoning  systems,  and  languages  based  on  logic, 
functions,  and  equations 

The  rationale  for  this  choice  is  the  following 

•  "Rule  baaed-Expert  Systems"  are  becoming  a  popular  programming  paradigm,  logic 
programming  provides  a  cohesive  formal  basis  for  such  systems 

•  A  number  of  research  efforts  are  attempting  to  enrich  the  logic  programming  framework 
with  the  abilities  to  do  knowledge  based  reasoning.  Except  for  theological  differences,  such 
systems  show  signs  of  soon  becoming  competitive  with  Lisp-based  environments 

•  An  orthogonal  set  of  research  efforts  is  aimed  at  integrating  the  notions  of  logic,  functional, 
equational,  and  object-based  programming. [2,  dj 

•  "Common  sense  reasoning"  techniques  need  to  be  supported  by  such  tools. 

•  Clause  based  systems  have  been  receiving  special  attention  because  of  the  success  of  Prolog 

•  A  number  of  the  tools  e  g  rewrite  systems,  simplifiers,  complexity  calculation.-,  etc  can 
potentially  be  expressed  relatively  easily  using  such  systems. 

Of  course,  there  are  a  number  of  technical  issues  to  be  resolved  in  many  of  these  areas.  The 
intent  here  is  not  to  suggest  that  everything  is  either  simple  or  has  been  solved,  but  merely  to 
establish  a  conceptual  framework  for  discussing  exploitable  parallelism  in  such  systems 


3  2  1  Ctaate-baied  tfitemi  Most  clause  based  systems  incorporate  a  central  component  similar 
to  that  shown  in  Figure  2  1  The  basic  function  of  each  of  the  components  in  the  diagram  is  as 

follows: 

1  Choose  a  tin  fie  elaate  from  the  knowledge  database.  This  clause  is  used,  along  with  other 
clauses,  to  infer  new  clauses.  The  choice  of  this  clause  is  usually  governed  by  some 
criterion  that  determines  how  relevant  or  “promising"  this  piece  of  information  is. 

2  Generate  new  elautea  using  the  given  clause,  a  variety  of  inference  rules,  and  (optionally) 
other  clauses  in  the  knowledge  base 

3  Simplify  Inferred  Claatet  involves  the  reduction  of  the  inferred  clauses  to  a  normal  form. 
This  process  often  requires  a  fair  amount  of  computation,  with  a  minimal  amount  of 
communication  involved. 

4  A  Generahtp  Cheek  is  made  to  check  whether  the  newly  inferred  clauses  already  exist  in 

the  knowledge  base 

3  One  the  new  clause  is  deemed  useful,  it  needs  to  be  integrated  into  the  knowledge  base 
data  structures 


KNOWLEDGE  DATABASE 


Figure  2.  Subprocesses  in  a  clause-based  theorem  proving  system 


In  such  a  system,  the  simplification  process  can  usefully  be  run  on  a  separate  processor  (it  is 
compute  bound,  and  requires  a  low  bandwidth  of  communication  with  the  knowledge  base).  All 
of  the  other  processes  typically  interact  heavily  with  the  knowledge  base.  The  basic  requests 
that  arise  are  of  the  form  that  require  the  set  of  formulas  that  satisfy  a  boolean  conjunct,  and 
that  can  be  unified  with  a  specified  formula.  The  degree  of  useful  parallelism  (loosely  coupled) 
is  determined  by  how  expeditiously  the  data  base  accesses  can  be  resolved. 

An  efficient  access  technique,  indicated  in  Figure  3,  is  to  partition  the  clausal  knowledge  base 
into  roughly  two  parts: 

1.  The  raw  data  is  represented  as  a  very  compact  list  structure,  in  which  each  formula  occurs 
just  once,  with  “pointers"  relating  it  to  each  occurrence  in  some  superstructure. 

2.  A  second  structure  is  constructed  to  offer  rapid  access  to  raw  data.  It  is  a  set  of  inversions 
based  on  distinct  properties  of  the  formulas,  allowing  rapid  isolation  of  the  desired 

formulas. 

An  important  characteristic  of  this  representation  is  that  the  access  structures  can  be  easily 
partitioned  into  an  arbitrarily  large  set  of  substructures.  The  set  of  formulas  can  be  partitioned 
into  an  arbitrarily  large  number  of  subgroups,  with  each  group  accessed  through  a  different 
access  structure.  Unfortunately,  the  core  data  does  not  seem  amenable  to  partitioning  as  easily. 


The  clause-baaed  component  of  a  design  environment  may  therefore  be  viewed  as  a  number  of 
such  pipes  running  in  parallel,  all  interacting  with  the  shared  knowledge  base  and  coordinated 
by  a  single  resource  manager.  The  possible  response  time  is  then  restricted  by  the  response 
time  of  the  knowledge  base.  The  hardware  support  appropriate  for  the  process  level 
partitioning  described  above  would  consist  of  a  loosely  coupled  hardware  engine  for  performing 
the  simplifications,  together  with  a  tightly  coupled  shared-memory  system  that  supports  the 


Figure:  Partitioning  access  structures  to  the  knowledge  base. 

3.3  Other  sources  of  Parallelism 

As  indicated  above,  many  of  the  tools  in  the  design  environment  may  be  amenable  to 
parallellization;  however,  each  must  be  examined  on  an  individual  basis.  Currently,  research 
along  such  lines  is  in  its  very  early  stages,  with  the  greatest  emphasis  being  placed  on  hardware 
accelerators  for  circuit  simulation [6],  There  are  several  ongoing  research  efforts  that  attempt  to 
exploit  parallelism  in  circuit  simulation  at  various  levels  of  detail,  for  example,  by  partitioning 
circuits  across  multiprocessor  clusters. In  addition,  there  are  an  increasing  number  of  commercial 
products  that  are  now  available  that  address  the  simulation  task.  Additionally,  hardware 
accelerators  for  routing  are  also  being  studied  as  a  means  of  improving  the  performance  of  VLSI 
design  tools,  e.g.,  (5] 

3.4  Parallelism  in  the  Underlying  Language  Implementation. 

A  further  source  of  parallelism  in  the  design  environment  is  parallel  support  for  the  languages 
that  are  used  to  implement  the  tools.  This  particular  task  clearly  has  a  very  broader  appeal, 
and  consequently  there  are  a  number  of  efforts  aimed  at  providing  multiprocessor 
implementations  for  functional  and  logic  languages,  as  well  as  for  Algol-  and  Lisp-like  languages. 
In  this  context,  functional  languages,  in  expressing  computations  as  the  evaluation  of 


expressions  such  as  f  (tl,  ...  tn)  allow  for  a  very  natural  expression  of  the  concurrency 
that  is  possible  in  an  execution:  tl,  ...  ,tn  can  all  be  executed  in  parallel.  This  observation 
extends  to  many  of  the  symbolic  manipulation  operations  involved  in  logic  and  algebra  based 
systems.  Consequently,  this  is  an  area  where  any  progress  will  have  significant  impact. 

4.  Summary 

This  paper  has  highlighted  the  fact  that  a  surprisingly  large  number  of  software  and  hardware 
design  activities  can,  given  appropriate  paradigms,  benefit  from  a  common  kernel  of  tools  The 
nature  of  these  tools  was  identified,  and  different  sources  of  parallelism  in  the  implementation 
of  design  environments  for  software/hardware  systems  were  examined  We  envision  that  future 
generation  design  environments  will  have  increasing  degrees  of  hardware  support  for  such  a 
kernel  of  design  tools,  as  well  as  the  underlying  implementation  '  nguages  The  identification  of 
the  components  in  such  commonly  used  tool  sets  will  help  bett  ammortize  the  development 
costs  involved  in  building  special  purpose  hardware  accelerators  for  such  tools  The  existence  of 
such  advanced  design  environments  will,  in  turn,  serve  to  improve  the  cost-effectiveness  and 
robustness  of  systems  designed  using  these  tools. 
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I.  Introduction 

Broadcasting  presents  a  powerful  technique  for  parallel/distributed  computing  As  of  now  it 
remains  a  largely  unexplored  territory  especially  in  terms  of  the  impact  it  may  have  in  all  areas 
of  parallel  processing.  In  this  context  we  summarise  our  experiences  with  broadcasting  in  the 
several  multiprocessors  built  over  the  last  few  years  at  our  research  laboratory.  It  has  been 
found  that  broadcasting  can  play  an  important  role  not  only  in  terms  of  improving  algorithmic 
efficiency  but  also  with  respect  to  making  the  parallel  programming  environment  more 
powerful.  In  order  to  substantiate  this  conclusion  we  present  the  role  of  broadcasting  in  the 
design  and  implementation  of  parallel  algorithms  (e.g.  prefix  computation  )  and  in  the 
implementation  of  the  concurrent  programming  language  Linda. 

II.  The  Model  of  Broadcasting  &  Previous  Research 

In  the  weakest  form  of  broadcasting  one  processor  transmits  one  message  on  the  bus  per  unit 
time.  During  this  time  all  the  other  processors  listen  to  the  broadcast.  Translating  this  concept 
into  that  of  a  shared  memory  model  this  implies  a  capability  on  the  part  of  processors  to  read 
any  one  memory  cell  simultaneouslyJ‘1  In  the  parallel  circuit  model  this  would  permit  processing 
nodes  to  have  unbounded  fan-out  -  a  capability  which  results  in  reduced  complexity  of  parallel 
computation  for  some  problems  like  prefix  computation.^  Even  though  capabilities  of 
broadcasting  have  relatively  been  ignored  ,n  literature  some  efficient  algorithms  (e  g  rinding  an 
extremum,  sorting,  selection,  merging)  based  on  the  mechanism  of  broadcasting  have  been 
proposed J1*'8!  Here  we  present  our  experience  with  broadcasting  as  a  useful  medium  for 
implementing  a  wider  class  of  algorithms  (e.g.  prefix  computation)  and  the  concurrent 
programming  language  Linda.  The  multiprocessor  considered  here  is  the  S/NET  J4' 

The  discussion  will  also  attempt  to  bring  out  the  different  variations  of  broadcasting  that 
are  possible.  For  our  purposes  let  the  channel  be  slotted  into  slots  of  time  Tg  Each  of  these 
slots  any  one  of  these  processors  may  execute  a  broadcast.  Thus  if  the  bus  connecting  N 
processors  is  N  times  as  fast  as  the  processor  cycle  time  Tp  it  has  the  potentiality  of  letting 
each  processor  broadcast  one  message  during  one  processor  cycle  time  Tp  ■*  N*Tg.  This  infact 
describes  the  S/NET  bus.  In  the  ensuing  discussion  the  questions  of  synchronisation  are  dealt 
with  at  the  algorithmic  level  -  that  is,  the  algorithms  considered  are  synchronous  in  nature.  The 
issue  of  bus  contention  is  presumed  to  be  decided  at  the  hardware  level  as  in  S/NET 

III.  Applications 

As  mentioned  before  the  impact  of  the  broadcasting  mechanism  are  fairly  wide  spread  -  from 
specific  computational  problems  to  implementation  of  parallel  programming  languages  like 
Linda.  In  this  section  some  of  these  applications  are  summarized  while  the  others  are 
mentioned. 

A.  Prefix  Computation 

Given  N  inputs  *  1,*2,...,xw_1,*/v  and  an  arbitrary  associative  operation  o  the  problem  of 
prefix  computation  is  to  produce  the  N  outputs  *  o ..ox.,  where  1  <  »  <  N  This  problem 
appears  in  hidden  forms  in  many  commonly  occuring  computations,  for  example,  evaluation  of 


linear  recurrences,  carry  computation  in  binary  adder  etc  ■  If  *r  have  V  processor*  with  one 
input  in  each  processor  it  may  be  deduced  that  time  of  computation  is  (  Tp  *  7  )*.V  which  is 
of  the  same  order  of  complexity  as  that  for  a  sequential  processor  If  however  we  have  onK  A 
processors  the  time  taken  can  be  reduced  to  7^,*(.Y/A'  -e  A  )  +  T  *A  The  algorithm  is  a* 
follows  Each  processor  computes  the  the  prefix  problem  on  its  ,Y/A  inputs  The  last  prefix 
from  processor  i  is  broadcast  which  is  used  by  processor  ul  to  compute  immediately  its  last 
prefix  which  is  then  broadcast  While  the  values  of  the  last  prefix  are  being  communicated 
between  processors  i  and  i+  1  the  processors  1  i—  I  are  busy  calculating  the  lower  order 
prefixes.  The  algorithm  assumes  that  the  inputs  are  distributed  in  a  prescribed  manner  - 
processor  i  contains  the  inputs  xu.  ,,r„  ...  Thus  the  best  time  for  the  case  7„  <  7„ 

.  .  -r-;  &{*—  IJ/IV  ♦  I  Ph/A  t  —  r 

is  achieved  when  a*v  A 

Interestingly,  however  a  simple  variant  of  this  algorithm  works  on  any  distribution  of  data 
To  see  how  this  algorithm  works  assume  any  data  distribution  of  j(,  ,  x  among  the  A 

processors  available  Also  for  convenience  describe  r  o  or  as  ji.jj  In  this  notation  then,  given 
l * , •  ] ,  1<‘<N  we  are  asked  to  compute  all  [l,ij.  By  max  |i,j]  we  mean  xo  ox  ,  such  that  k  = 
maximum  j  for  which  z  o  ox*  is  available  or  can  be  formed  from  the  locally  available  values  of 
data.  The  sketch  of  the  algorithm  which  runs  on  each  processor  is  as  follows 

{(  If  processor  i  contains  [»,»],  [yj’j.ji.kj.  then 

it  is  also  responsible  for  calculating  All 

processors  execute  the  same  algorithm  ) 

while  there  exists  a  unused  data  value  do 

pick  an  unused  [i,i]  with  least  index  i 
compute  max  (»,/) 
end  of  while 

Broadcast  max  [l,i] 

while  an  |l,i]  remains  to  be  computed  do 
receive  [l,/] 

compute  max  [I ,k),k>j 
broadcast  max  [l,k| 

compute  all  computable  \l,m),j  <m  <k 
end  while  } 

Infact  stronger  properties  about  the  performance  of  the  algorithm  can  be  shown  If  we  define 
the  parallel  time  of  computation  such  that  it  includes  the  time  taken  to  redistribute  the  data  as 
well  as  the  actual  computation  time  then  it  may  be  proved  that  the  algorithm  acheives  optimal 
speedup  for  a  large  class  of  data  distribution.  Thus  the  gains  from  having  a  desirable  data 
distribution  may  be  offset  by  the  time  taken  to  achieve  that  distribution.  It  may  also  be  noted 
that  feature  of  broadcasting  is  instrumental  in  making  this  algorithm  work  for  all  data 
distribution. 

B.  Linda  Implementation 

The  concurrent  programming  language  Linda  was  designed  at  Yale  University  and 
implemented  on  the  S/NET  by  Carriero  et.  alJ6^  Linda  differs  from  other  parallel  languages  by 
adopting  the  generative  communication  model.  A  shared  tuple  space  (TS)  is  accessible  to  all  the 
processes  in  the  distributed  program  and  two  processes  may  communicate  only  by  reading  from 
or  writing  into  this  shared  tuple  space  (Figure  1).  The  broadcast  capability  provides  a 
convenient  underlying  communication  mechanism  to  implement  the  language.  There  are  four 
operations  defined  over  TS:  ou<(),in(),r<a</()andet’a/().  o«f(«)  adds  a  tuple  t  to  the  TS.  tn(«) 
withdraws  a  matching  tuple  t  from  TS,  read(s)  assigns  actuals  to  the  formals  as  in  in(s)  except 


that  the  tuple  is  not  removed  from  TS.  eval(s)  adds  an  unevaluated  tuple  to  TS.  This  design 
feature  of  the  language  offers  enhanced  efficiency  in  a  parallel  programming  environment  where 
data  objects  may  have  to  be  shared.  Using  the  facility  of  broadcasting  this  TS  is  implemented  in 
a  simple  manner  as  follows  The  tuple  space  is  duplicated  on  all  processors.  Whenever  an 
in(),  out()etc  is  executed  the  corresponding  template  is  broadcast  and  a  similar  action  is  taken 
by  all  the  nodes.  In  this  way  the  synchronization  (and  thus  the  consistency  of  TS  in  all 
processors)  is  effectively  realized  A  second  implementation  in  which  the  the  tuples  remain  on 
the  originating  nodes  until  they  are  explicitly  asked  for  is  currently  being  implemented^.  The 
functions  tn(),oiif()  etc.  are  then  realized  by  broadcasting  the  templates  -  the  corresponding 
action  of  insertion  or  deletion  from  the  TS  is  taken  by  the  processor  which  contains  the 
matching  tuple. 

IV.  Conclusion  &  Research  Issues 

It  should  be  clear  that  broadcasting  has  many  innovative  uses  which  should  be  further 
explored.  These  uses  can  be  classified  into  fundamental  (that  is,  algorithmic)  and 
implementationaJ.  For  example,  the  use  of  broadcasting  in  implementation  of  multiprocessor 
operating  systems  offers  some  advantages. 

Another  issue  of  significant  interest  .'  the  relative  advantages  and  disadvantages  which 
multiprocessors  with  different  Tg&Tp  have  to  offer.  For  example,  broadcasting  on  a 
hypercubeM  of  N  nodes  takes  time  logN.  This  may  be  contrasted  with  the  fast  speed  of  the 
S/NET  bus  in  which  the  time  taken  per  broadcast  is  l/N  times  the  processor  speed.  But  again 
if  we  conceive  of  multiple  broadcasts  taking  place  simultaneously  -  as  in  the  hypercube, 
interesting  applications  arise.  Another  related  issue  is  the  use  of  broadcasting  in  absence  of 
synchronization  among  the  messages  belonging  to  a  single  broadcast. 

In  summary,  broadcasting  offers  interesting  possibilities  not  only  in  the  execution  of  specific 
algorithms  but  also  ’n  the  implmentation  of  general  parallel  programming  software  and  as  such 
will  play  an  important  role  in  future  multiprocessing  systems. 
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INTRODUCTION 

Efficient  use  of  extensive  parallelism  with  a  wide  range  of  algorithms  is  required 
to  meet  future  computation  demands.  The  problems  are  discussed  of  providing: 
high  performance,  flexibility,  extendability.  reliability  and  ease  of  programming. 
An  optical  crossbar  interconnected  processor  is  described,  including  the  associated 
optics,  and  programming  methodology.  The  Levinson-Durbin  algorithm  for  solving 
Toeplitz  matrix  equations  is  considered  as  an  illustration  of  how  2-D  autoregressive 
models  could  be  computed  for  image  texture  classification.  An  executable  flow 
graph  is  generated  for  the  Levinson-Durbin  algorithm  with  unrolled  iterations.  The 
flow  graph  is  implemented  on  a  simulator  and  the  resulting  activity  chart  shows  a 
100’ 7  efficiency  once  the  pipeline  is  filled.  The  advantages  and  limitations  of  the 
proposed  architecture  are  summarized  in  the  conclusions. 

DIFFICULTY  OF  ACHIEVING  DESIRED  FEATURES. 

Extendability  implies  that  more  processors  may  be  added  to  the  multiprocessor 
together  with  corresponding  interconnections  without  requiring  new  software  and 
with  performance  approaching  linear  improvements  with  number  of  processors.  The 
number  of  parallel  processors  that  may  be  used  efficiently  is  limited  in  today’s 
prototype  and  proposed  systems  by  the  communication  delay  and  interconnection 
complexity.  These  systems  are  not  generally  very  extendable. 

Flexibility  to  efficiently  run  a  wide  range  of  algorithms  may  be  achieved  by  recon¬ 
figurability.  Reconfigurability  also  enables  software  to  readily  adjust  to  an  extended 
system.  However,  reconfigurability  introduces  large  delays  and  high  control  over¬ 
head  in  most  proposed  systems.  This  severely  restricts  the  number  of  processors 
that  may  operate  efficiently. 

Nearest  neighbor  interconnected  configurations  are  easy  to  construct  and  provide 
high  speed  communications.  The  number  of  elements  in  an  array  may  be  increased 
without  altering  the  bandwidth  at  an  input  or  output  connection.  This  permit- 
large  numbers  of  processors  to  be  efficiently  used  in  parallel  for  suitable  algorithms. 
However,  the  nearest  neighbor  connection-  limit  the  flexibility  or  range  of  algo¬ 
rithms  that  may  be  implemented  efficiently.  For  example,  many  fast  algorithms  use 


recursive  doubling  which  results  in  complex  non  nearest  neighbor  communication-, 
e.g.  FFT.  Systolic  svsrems  prearrange  dataflow  so  that  input  and  output  occur- 
at  the  edges  of  the  processor  array  at  each  cycle  2  .  Latency  is  increased  rela¬ 
tive  to  serial  machines  and  this  causes  difficulties  for  general  purpose  or  adaptive 
computing. 

Reliability  is  often  accomplished  by  means  of  redundancy  in  software,  hardware, 
time  and/or  space.  This  is  detrimental  to  satisfying  performance  for  a  given  cost. 
Extended  systems  are  likely  to  be  less  reliable  because  of  the  greater  possibility  of 
interference. 

Ease  of  programming  is  important.  Mapping  a  complicated  algorithm  flow  graph 
to  a  complicated  machine  flow  graph  is  difficult.  The  ability  to  perform  the  mapping 
automatically  is  enhanced  if  the  machine  flow  graph  is  less  constrained,  as  for  a 
crossbar  interconnected  system. 

PROPOSED  OPTICAL  CROSSBAR  INTERCONNECTED 

PROCESSOR 

System 

Figure  1  shows  a  preliminary  organizational  structure  for  an  optical  crossbar 
signal  processor  4.3.  The  inputs  and  outputs  of  hundreds  of  processing  elements 
are  connected  to  an  optical  switch  by  means  of  commercially  available  fiber  optic 
links  of  bandwidth  160  MHz  or  more.  The  processors  perform  elementary  oper¬ 
ations  such  as  multiply  or  add  and  therefore  have  two  input  connections  for  the 
two  operands.  This  fine  granularity  permits  the  maximum  amount  of  parallelism 
to  be  extracted  from  algorithms.  The  processing  element  output  is  converted  from 
parallel  to  serial  in  a  shift  register  for  driving  the  fiber  optic  link.  A  second  fiber 
optic  loop  between  processors  and  main  memory  banks  provides  input  output. 

Optical  switch  and  interconnections 

Optics  is  used  for  interconnections  because,  relative  to  electronics,  it  provides 
high  levels  of  parallelism,  high  bandwidth,  large  fan-in  and  fan-out.  and  high  im¬ 
munity  to  interference.  These  features  enable  the  use  of  fine  granularity  and  recon¬ 
figurability.  Each  intersection  in  a  crossba'  witch.  Fig.  2a.  has  a  switch  permitting 
a  horizontal  input  line  to  be  coupled  with  ertical  output  one.  Figure  2b  shows  a 
diagrammatic  crossbar  switch  implement<  vith  a  spatial  light  modulator  (SLM) 
and  dors  indicate  transparent  regions  coi  i-fent  with  the  closed  switch  setting- 
marked  by  dots  in  Figure  2a.  An  optical  tens  system  is  used  to  spread  the  light 
from  the  input  sources  horizontally  without  spreading  the  light  vertically.  Light 
passing  through  the  spatial  light  modulator  is  collapsed  on  to  receiving  diodes  by 
means  of  a  lens  system  which  focusses  vertically  without  spreading  horizontally. 


A  reflecting  membrane  covers  the  surface  of  an  array  of  transistors  on  a  sili¬ 
con  chip.  Fig.  3.  to  form  a  deformable  mirror  device  (DMD).  The  right  side  of  the 
modulator  is  folded  back  and  a  beam  splitter  used  because  DMD's  are  reflective 
SLM's.  Activation  of  a  transistor  causes  the  membrane  to  dip  above  the  transistor. 
A  Schlieren  system  accounts  for  the  reflections  from  the  regions  between  mirror  de¬ 
flection  pixels.  Imaging  and  performing  spectral  analysis  with  a  Texas  Instrument's 
DMD  of  size  128  by  128  has  been  published  '5  .  It  takes  one  microsecond  per  row 
to  set  the  DMD  up  and  a  few  microseconds  to  switch  to  the  new  settings. 

Software  and  programming  approach 

The  flow  of  data  is  prearranged  so  as  to  minimize  run  time  overhead.  A  compiler 
maps  an  algorithm  graph  into  the  processor:  assigning  the  nodes  in  the  graph  to 
processing  elements  and  the  edges  or  links  to  crossbar  settings.  Data  flowing  into 
the  switch  is  routed  to  the  appropriate  processor.  A  processor  will  perform  the 
operation  for  which  it  is  programmed  on  the  next  clock  cycle  after  receiving  its 
operands.  The  output  is  routed  via  the  switch  to  the  next  processor. 


EXAMPLE  OF  PROGRAMMING  AND  PERFORMANCE 

EVALUATION 


Image  texture  classification  example 

The  idea  is  to  compute,  for  each  segment  of  an  image,  a  2-D  AR  model  or  filter 
that  when  applied  to  the  2-D  image  segment  will  remove  all  the  information  leaving 
only  white  noise.  The  2-D  filter  may  then  be  considered  an  estimator  of  texture. 
It  may  be  used  to  correlate  against  templates  for  the  problem  domain,  e.g.  trees 
and  roads.  2-D  dynamic  programming  may  be  appropriate  to  allow  for  stretch  or 
compression.  Wiener  or  least  square  filter  theory  1  may  be  used  to  derive  a  set  of 
block  Toeplitz  matrix  equations  whose  solution  are  the  required  filter. 

Yule- Walker  equations  and  Levinson-Durbin  algorithm 

In  order  to  simplify  the  discussion  the  1-D  case  is  considered.  The  implementa¬ 
tion  of  the  2-D  case  on  the  proposed  processor  is  thought  to  be  a  straight  forward 
extension.  The  Yule- Walker  equations  may  be  solved  for  I-D  AR  parameters  a  from 
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The  Levinson-Durbin  algorithm  follows.  At  the  nth  iteratior  a  reflection  coeffi¬ 
cient  is  computed  as  the  inner  product 
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The  power  of  the  white  noise  associated  with  the  AR  process  is  computed  from 

v(n)  —  v{n  -  1)(1  -  ic(r?)  2).  (4) 

Minimum  delay  is  maintained  by  updating  the  AR  parameters  from 

«k(v  -  1)  —  c^an  ki”  1)  k  =  0  to  n.  (5) 


The  autocorrelation  function  may  be  computed  by  reconfiguring  the  crossbar 
and  processors  to  represent  a  tree  4  . 

Flow  graph  for  unrolled  Levinson-Durbin  algorithm 

The  processing  of  many  segments  of  an  image  provides  continous  sets  of  data 
for  which  the  algorithms  must  be  performed,  therefore  pipelining  is  advantageous. 
The  Levinson-Durbin  iterations  are  unrolled  into  a  long  section  of  code.  Only  four 
loops  are  unrolled  to  simplify  the  explanation. 

In  the  flow  graph.  Figure  4.  each  operation  is  assigned  to  a  node  of  the  graph 
and  the  flow  of  data  between  operations  is  identified  by  links  between  nodes.  Nodes 
marked  with  subtraction  imply  the  subtraction  of  the  right  hand  input  from  the  left 
hand  input.  The  triangular  arrows  indicate  negation  or  unary  minus  which  may  be 
accomplished  at  the  input  to  the  appropriate  node  rather  than  with  an  extra  node. 
Identity  instructions  marked  "Idem"  support  the  fanout  of  operands  so  that  an 
input  connects  to  a  processing  element  before  going  through  the  crossbar.  Delays 
are  inserted  as  numbers  on  the  graph  edges. 

Performance  of  Levinson-Durbin’s  algorithm 

Figure  5  shows  an  activity  chart  resulting  from  running  a  dataflow  simulator 
for  the  How  graph  of  figure  4.  Processors  1  and  2  operate  in  parallel  in  the  first 
Levinson-Durbin  computation.  In  clock  2  the  processor  node  3  now  operates  on 
this  computation.  Meanwhile  processors  1  and  2  are  free  to  start  on  the  next 
Levinson-Durbin  computation.  After  11  clock  cycles  the  pipeline  is  full,  and  the 
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result  for  the  first  Levinson-Durbm  computation  is  obtained.  Once  the  pipeline  is 
filled,  the  efficiency  of  computation  is  KM)1/.  It  is  reasonable  to  unroll  up  to  10 
iterations  of  Levinson-Durbin  for  operation  with  the  500  processor  system  and  still 
have  sufficient  processors  for  the  other  computations  necessary. 

Conclusion 

Advantages  of  proposed  processor.  Fine  granularity  processing  elements  permit  the 
efficient  use  of  high  levels  of  parallelism.  Flexibility  to  run  a  wide  range  of  algo¬ 
rithms  with  ease  is  achieved  by  reconfigurability.  The  number  of  wires  connecting  to 
an  electronic  crossbar  switch  of  equal  throughput  to  the  proposed  optical  crossbar 
would  be  tremendous  because  parallel  connections  would  be  required.  Optics  has 
the  bandwidth  to  permit  serial  connections.  The  processor  may  be  doubled  in  size 
by  adjoining  a  similar  system  and  adding  exchange  switches.  Programming  may  be 
accomplished  simply  by  writing  flow  graphs  and  using  a  compiler  to  map  the  flow 
graphs  to  the  machine.  An  image  texture  classification  example  was  discussed  to  il¬ 
lustrate  the  programming  procedure.  Reconfigurability  and  optical  interconnections 
improve  reliability. 

Limitations  of  proposed  processor.  The  desirable  features  listed  above  are  achieved 
by  having  a  more  costly  interconnection  system  than,  for  example,  a  nearest  neigh¬ 
bor  scheme.  Consequently,  the  proposed  system  is  limited  to  high  performance 
systems  where  flexibility,  extendability  and  reliability  are  required.  The  large  fan¬ 
out  possible  with  optics  by  means  of  lenses  is  accompanied  by  an  energy  loss  which 
increases  the  cost  of  the  amplifiers  required.  The  times  to  load  and  switch  the  cross¬ 
bar  switch  constrain  the  rate  at  which  algorithms  may  be  changed  in  a  dataflow 
mode. 
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SUMMARY 


The  performance  of  multiprocessor  systems  depends  largely  on  the  performance  of 
the  interconnection  network  (IN)  that  connects  processors  to  memories  or  processors  to 
processors.  In  addition  to  providing  full  connectivity  at  low  cost,  the  IN  must  be  fault 
tolerant  and  be  suitable  for  a  varied  class  of  algorithms  that  the  system  may  execute. 


Although  a  lot  of  emphasis  has  been  placed  lately  on  a  class  of  networks  called  multistage 
interconnection  networks  (MIN’s)  [l],  they  do  not  satisfy  the  later  conditions.  Recently 
there  has  been  an  upsurge  of  interests  in  multiple-bus  systems.  This  system  provides  in¬ 
herent  simplicity  of  the  single-bus  architecture  while  providing  a  large  bandwidth  (BW) 
for  data  transfer.  Also,  it  is  suitable  for  any  kind  of  algorithms  and  has  an  added  advan¬ 


tage  of  fault-tolerance.  This  paper  will  present  the  performance  of  both  centralized  and 
decentralized  multiple-bus  interconnections. 


A  multiprocessor  organization  can  be  broadly  divided  into  two  catagories,  namely  : 
tightly  coupled  and  loosely  coupled.  The  tightly  coupled  multiprocessors  are  characterized 
by  a  close  interaction  between  the  processors  that  is  effected  through  a  bunch  of  global 


memories.  In  loosely  coupled  systems,  the  interactions  are  less  frequent  and  there  may  not 


be  any  global  memories  for  direct  access  at  all.  In  a  multiple-bus  multiprocessor,  there 
are  B  buses  that  connect  N  processors  to  N  memories  in  a  tightly  coupled  system  or  con¬ 
nect  N  processors  in  a  loosely  coupled  system  for  B  <  N.  The  structure  provides  B  alter¬ 
nate  paths  for  communication  and  is  therefore  highly  fault-tolerant.  The  value  of  B  can 
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be  fixed  depending  on  the  communication  requirement  of  the  system.  Based  on  operation, 
the  multiprocessor  network  can  be  divided  into  four  catagories,  namely  : 

Centralized  synchronous, 

Centralized  asynchronous, 

Decentralized  asynchronous  and 
Hybrid. 

Centralized  means  that  there  is  a  central  controller  to  oversee  the  operation  of  the  IN 
and  synchronous  means  the  IN  operates  based  on  a  clock  cycle  so  that  all  the  processors 
submit  their  requests  at  the  begining  of  the  clock.  The  data  transfer  is  also  completed  by 
the  end  of  the  cycle.  A  hybrid  network  may  consist  of  a  combination  of  the  above  three 
types.  In  this  paper  we  will  point  out  some  of  the  previous  work  done  in  the  performance 
analysis  of  various  multiple-bus  INs  and  mention  the  future  scope  of  study. 

A  lot  of  research  has  gone  into  the  performance  analysis  of  centralized  tightly  cou¬ 
pled  multiple-bus  INs  [2-4],  assuming  that  a  processor  sends  a  request  to  any  memory 
module  with  equal  probability.  Analysis  has  also  been  reported  when  a  processor  has  a 
favorite  memory  [5].  As  reported  in  the  above  papers,  the  BVV  linearly  increases  with  the 
increase  in  number  of  buses  until  it  is  saturated  by  the  memory  access  conflicts.  In  cen¬ 
tralized  asynchronous  tightly  coupled  operation,  the  processors  send  the  memory  request 
packets  to  the  central  controller.  The  central  controller  builds  a  queue  for  each  memory 
and  allocates  buses  to  these  memories  in  a  cyclic  fashion.  Analysis  of  this  system  has  been 
reported  in  [6,7].  In  a  decentralized  asynchronous  operation,  we  need  to  follow  some  local 
area  network  protocob  to  capture  the  bus.  Decentralized  protocols  adopted  in  token,  slot¬ 
ted  or  Ethernet  [8]  buses  can  be  applicable.  As  can  be  seen  from  the  current  university 
and  industry  supercomputer  projects,  the  trend  is  to  employ  packet  switching  in  the  INs 
instead  of  circuit  switching.  We  are  in  the  process  of  analysis  and  design  of  suitable 
multiple-bus  protocols  for  tightly  coupled  multiprocessors.  No  work  has  been  reported  yet 


on  hybrid  networks  suitable  for  processor  -  memory  interconnection. 

Analyses  have  been  reported  for  all  the  networks  in  a  loosely  coupled  environment.  A 
loosely  coupled  system  is  similar  in  operation  to  a  local  area  network  except  that  the  in¬ 
teraction  in  the  former  is  at  a  much  lower  level  and  has  very  fine  granularity.  Analysis  of 
a  loosely  coupled  system  is  much  simpler  because  there  is  no  memory  access  conflict  and 
the  processor  does  not  have  to  wait  until  the  current  request  is  satisfied.  Because  the  in¬ 
teraction  is  less  frequent,  techniques  such  as  context  switching  can  be  applied  in  a  loosely 
coupled  multiprocessor.  It  is  usually  assumed  that  a  processor  generates  requests  as  per 
Poisson  process  and  the  network  efficiency  is  measured  in  terms  of  delay-throughput 
characteristic.  Analysis  of  a  synchronous/asynchronous  multiple-bus  network  is  equivalent 
to  M/G/m  queueing  analysis  that  can  be  obtained  from  any  standard  book  [9).  A  decen¬ 
tralized  asynchronous  operation  is  similar  to  the  operation  of  multichannel  local  area  net- 
works  whose  performance  have  started  appearing  recently.  Multichannel  protocols  and 
analyses  for  carrier  sense  multiple  access  /  collision  detection  (CSMA-CD)  have  been  re¬ 
ported  [10, 1 1] .  In  our  opinion,  these  protocols  are  also  well  suited  for  multiprocessor 
operations.  We  have  also  evaluated  the  performance  of  some  hybrid  networks  that  con¬ 
sists  of  a  few  CSMA/CD  buses,  centralized  and  token  buses  [12,13].  The  idea  is  to  divert 
the  traffic  from  the  contention  buses  when  the  level  of  contention  is  high.  As  a  result  we 
obtain  much  better  performance. 

Finally  we  have  thoroughly  evaluated  the  reliability  and  availability  and  our  results 
indicating  that  the  multiple-bus  performs  the  best  [5,14] .  In  conclusion,  we  believe  that 
multiple-bus  offers  a  simple,  flexible,  fault- tolerant  and  easily  expandable  interconnection 
scheme.  Industries  always  prefer  a  shared-bus  connection  [15].  If  communication  is  a  prob¬ 
lem,  let  us  put  another  bus  in  parallel  and  see  what  happens! 
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AUTOMATICALLY  RECON  FIGUR  ABLE  COMPUTER  ARCHITECTURE 


1.0  Automatically  Reconfigurable  Computer 
Architecture 

This  paper  describes  a  proposed  automatically  reconfigurable  cellular  architecture.  The  unique 
feature  of  this  architecture  is  that  the  reconfiguration  control  is  distributed  within  the  system.  There 
is  no  need  for  globed  broadcasting  of  switch  settings.  This  reduces  the  interconnection  complexity 
and  the  length  of  data  paths.  The  system  can  reconfigure  at  the  request  of  the  applications  software 
or  in  response  to  detected  faults.  This  architecture  supports  fault  tolerant  applications  since  the 
reconfiguration  can  be  self-  triggered  from  within.  The  complete  reconfiguration  process  can  pro¬ 
ceed  without  external  interference. 


/./  Introduction 

Recent  advances  in  VLSI  technology  has  made  it  possible  to  interconnect  many  small  com¬ 
puters  together  to  achieve  high  parallelism.  There  are  many  topologies  for  interconnecting 
processors.  The  optimum  choice  for  interconnection  strategy  is  frequently  application  dependent; 
therefore,  most  of  these  architectures  can  be  used  only  for  a  small  number  of  applications.  In  order 
to  support  a  wide  range  of  applications,  it  is  highly  desirable  to  design  and  build  a  general  purpose 
reconfigurable  multiprocessor  system. 

The  demand  for  very  high  performance  computing  systems  has  forced  researchers  to  consider 
non-traditional  architectures,  notably  distributed/ parallel  systems!  10- 14).  This  is  because  the  ex¬ 
ponential  execution  time  required  by  many  algorithms  can  be  significantly  reduced  by  exploiting 
inherent  parallelism.  Most  of  these  new  architectures  attempt  to  match  the  underlying  hardware  to 
specific  problcms(algorithms)  for  fast,  efficient  execution. 

Rung  (1)  has  proposed  such  a  class  of  architectures  known  as  systolic  arrays  that  can  be  used 
to  perform  a  variety  of  highly  parallel  computations  such  as  matrix  multiplication,  fast  Fourier 
transformation,  etc.  Each  processor  in  these  arrays  performs  a  simple  and  short  computation  and 
regularly  pumps  data  in  and  out.  But  only  a  limited  number  of  functions  can  be  performed  with 
each  type  of  interconnection  network.  What  is  needed  is  a  general  purpose  reconfigurable  archi¬ 
tecture  that  allows  many  of  the  special  purpose  architectures  to  be  embedded  in  a  single  structure. 

Considerable  attention  is  being  given  to  such  a  general  purpose  reconfigurable  architecture  in 
recent  literature.  Snyder  [2|  demonstrated  the  feasibility  of  such  an  architecture  with  the  CHIP 
computer  (Configurable,  Highly  Parallel  computer),  which  provides  a  programmable  intercon¬ 
nection  structure  integrated  with  processing  elements.  It  is  designed  to  provide  the  flexibility  needed 
to  compose  general  solutions  while  retaining  the  benefits  of  uniformity  and  locality  that  the  algo¬ 
rithmically  specialized  processors  exploit.  It  consists  of  a  switch  lattice,  in  which  the  switches  are 
set  to  create  the  best  interconnection  network  for  the  function  to  be  executed.  An  external  con¬ 
troller  broadcasts  a  command  to  all  the  switches  to  invoke  the  appropriate  architecture. 

This  approach  has  two  major  disadvantages.  First,  the  setting  of  the  switches  is  controlled 
by  an  external  processor.  This  necessitates  some  type  of  global  connection  to  all  switches.  Sec¬ 
ondly,  the  master  control  circuitry  becomes  a  single  point  failure  site,  since  it  would  be  necessary 
for  the  master  control  to  work  correctly  in  order  for  the  appropriate  switch  setting  to  be  invoked. 
Thus  the  failure  of  the  master  control  will  significantly  degrade  the  system  reliability.  This  is  highly 
undesirable  for  failure  critical  applications  such  as  automatic  landing  of  commercial  aircraft,  control 
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systems  for  nuclear  reactors,  life  suuport  systems  for  medical  applications,  etc.  Even  in  less  critical 
applications,  system  downtime  is  often  very  expensive. 

In  this  paper,  we  propose  a  reconfigurable  cellular  architecture  in  which  the  reconfiguration 
mechanism  is  distributed  throughout  the  array  instead  of  being  a  single-point  failure  problem.  In 
addition,  reconfiguration  does  not  require  an  external  processor  to  compute  the  new  intercon¬ 
nection  pattern. 


1.2  Proposed  Structure 

Our  desire  is  to  be  able  to  implement  a  set  of  specific  architectures  designed  to  solve  a  specified 
set  of  parallel  computations  in  a  single  reconfigurable  system.  Cellular  arrays  are  a  viable  compu¬ 
tational  architecture  for  such  an  implementation.  A  cellular  (iterative)  array  is  a  collection  of  iden¬ 
tical  cells  that  are  interconnected  in  a  uniform,  or  regular,  fashion.  A  cellular  array  is  proposed  for 
the  following  reasons.  First,  they  are  of  highly  parallel  nature.  Secondly,  since  all  the  cells  in  the 
array  are  identical  the  architecture  becomes  easily  expandable  without  changing  the  current  hard¬ 
ware  in  any  significant  way.  Lastly,  each  processor  is  connected  to  other  processors  according  to  a 
regular  interconnection  pattern.  By  using  regular  local  interconnection  patterns  we  can  avoid  the 
use  of  global  connections,  so  that  the  interconnection  complexity  will  not  increase  with  the  size  of 
the  system.  By  making  the  control  distributed  throughout  the  array,  the  "hard  core'  component 
will  be  minimal.  Being  a  planar,  regular  structure,  such  a  parallel  processor  is  well  suited  for  VLSI 
implementation. 

The  proposed  cellular  structure  is  composed  of  two  cellular  arrays  that  are  interconnected  as 
shown  in  Figure  1  on  page  3.  The  cellular  structure  consists  of  a  'control  hyperplane'  and  a 
'computation  hypcrplane',  where  for  each  cell  in  the  computation  hypcrplane  there  is  an  associated 
cell  in  the  control  hyperplane.  Each  cell  in  the  computation  hyperplanc  can  be  either  a  switch  or 
a  processing  element,  as  shown  in  Figure  2  on  page  4.  If  it  is  a  processing  element  it  must  be 
complex  enough  to  realize  the  functions  required  by  each  of  the  algorithms,  i.c.,  each  processing 
element  is  some  type  of  universal  logic  module,  or  microprocessor,  that  can  perform  a  list  of  func¬ 
tions. 

Each  cell  in  the  computation  plane  is  controlled  by  the  state  of  the  corresponding  cpll  in  the 
control  plane.  If  the  cell  in  the  computation  plane  is  a  processing  clement,  then  the  control  cell 
specifies  a  particular  algorithm  from  a  set  of  possible  algorithms  that  the  processing  element  can 
implement.  If  the  cell  in  the  computation  plane  is  a  switching  clement,  then  the  control  cell  will 
specify  a  particular  local  interconnection  of  the  switcliing  element  to  its  neighbors.  The  overall 
function  to  be  performed  by  the  cellular  structure  is  defined  by  the  global  pattern  of  control  states. 

To  create  the  desired  configuration  for  a  particular  computation,  one  cell  in  the  array  is  ini¬ 
tialized  to  a  'seed'  state  which  defines  the  global  computational  task  to  be  performed  by  the  array. 
The  cellular  array  will  then  utilize  that  information  to  'grow'  the  required  pattern  of  states  in  the 
control  hypcrplane.  The  pattern  of  states  in  the  control  hypcrplane  invokes  the  desired  intercon¬ 
nection  structure  in  the  computation  plane. 


1.3  Theory  of  Cellular  Reconfiguration 

The  control  hyperplane  is  responsible  for  assigning  proper  functions  to  the  cells  in  the  com¬ 
putational  hyperplane.  Any  arbitrary  cell  in  the  control  hyperplane  will  eventually  receive  infor¬ 
mation  about  the  global  function  to  be  implemented  from  the  'seed'  state  initially  planted  at  an 
arbitrary  location.  The  way  in  which  the  given  information  is  distributed  throughout  the  control 
plane  to  produce  a  desired  final  pattern  is  explained  in  this  section.  This  process  will  be  referred 
to  as  'growth'. 

The  control  hyperplane  is  an  array  of  identical  cells  interconnected  in  a  uniform  fashion,  where 
each  such  cell  can  receive  state  information  only  from  its  neighbors.  The  cell  which  initially  receives 
information  about  the  global  function  will  communicate  with  its  neighbors  and  gradually  spread  the 
information  through  the  array  to  create  the  final  desired  pattern.  This  section  explains  the  "growth' 
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process,  the  way  in  which  the  given  information  is  transmitted  throughout  the  control  plane  to 
produce  a  desired  final  pattern.  Since  the  control  hypcrplane  is  an  array  of  identical  cells  inter¬ 
connected  in  a  uniform  fashion,  each  cell  can  receive  state  information  only  from  its  neighbors. 
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This  process  is  illustrated  in  the  following  example. 


Each  row  describes  the  state  of  the  one-dimensional  array  at  a  particular  time.  At  t  =  0,  the 
"seed*  state  'S'  is  planted  at  an  appropriate  place  in  the  array  with  all  other  cells  in  the  'quiscent' 
state  'O'.  At  each  time  step,  each  cell  observes  the  state  of  each  of  its  immediate  neighbors  and  its 
own  state,  then  decides  what  its  next  state  will  be. 


For  example,  at  time  t  ■  0,  cell  0  observes  that  the  cell  to  its  immediate  left  is  in  the  '0'  state, 
cell  0  is  in  the  'S'  state  and  the  cell  to  its  immediate  right  is  in  the  '0'  state.  The  local  pattern  for 
cell  0  at  time  0  is  said  to  be  0S0.  When  any  cell  in  the  array  is  in  a  local  pattern  of  0S0  at  time  t, 
it  will  change  its  own  state  to  state  A  at  time  t  +  1.  Similarly,  if  a  cell  is  in  a  local  pattern  of  00S 
or  S00  at  time  t  it  will  also  change  its  own  state  to  state  A  at  time  t  +  1 .  Any  cell  in  local  pattern 
000  at  time  t  will  stay  in  local  state  0  at  time  t+  1.  Therefore,  the  result  of  each  cell  at  time  t  =  0 
observing  its  local  pattern  and  changing  to  the  appropriate  next  state  will  transform  the  global 
pattern 


OOOOOOOSOOOOOOO 
at  time  t  =  0,  into  the  global  pattern 

...0  0  0  0  0  0  AAA  0  0  0  0  0  0. .. 

at  time  t=  1.  In  like  manner,  the  global  pattern  at  t  =  I  will  be  transformed  into  the  global  pattern 

...0  0  0  0  0  B  B  C  B  B  0  0  0  0  0.  .. 

at  t  *  2  by  each  cell  applying  the  following  local  transformation 


local 
pattern 
0  0  0 
0  0  A 

0  A  A 
AAA 
A  A  0 
A  0  0 


next 

state 

0 

B 

B 

C 

B 

B 


Eventually,  the  desired  global  pattern  'COMPUTERS'  will  be  reached  and  will  remain  stable  if 
each  cell  applies  the  following  local  transformation. 
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In  general,  any  such  control  hyperplane  can  be  characterized  as  a  tessellation  automaton  [3]. 

The  tessellation  automaton  (TA)  is  a  four  tuple 

TA-(A,Ed,X,o) 

where, 

1.  A  is  a  finite  non-empty  set  called  the  state  alphabet.  For  our  previous  example,  A  = 
{0,SIA,B,C,I,E,O,M,P,U,T,R} 

2.  Ed  is  the  set  of  all  d-tuples  of  integers  called  the  tessellation  space.  Here  the  tessellation  space 
is  said  to  be 'd'  dimensional.  In  our  example,  Ed  is  simply  the  set  of  all  integers 

...  -3,  -2,  -1,  0,  1,  2,  3,  ... 

Ed  defines  the  spatial  location  of  each  cell  in  the  array. 

3.  X  is  an  n-tuple  of  distinct  d-tuples  of  integers  called  the  neighborhood  index  Each  cell  is  said 
to  have  n  neighbors  and  n  is  called  the  neighborhood  scope.  In  the  example,  n  =  3,  and  X  = 
(-1,  0,  1).  X  defines  the  relative  coordinates  of  a  cells  neighbors.  For,  example,  cell  5  has 
neighbors  (4,5,6)  obtained  by  adding  each  coordinate  of  the  neighborhood  index  to  the  cell 
location  (5). 

4.  o  is  a  mapping  from  An  into  A  called  the  local  transformation. 

Each  cell  will  decide  its  next  state  by  observing  the  present  state  of  its  neighbors.  In  the  example, 
o(BCB)  =  I,  o(MPU)  =  P,  etc.  It  is  desirable  from  the  stability  point  of  view  that  each  cell  be 
its  own  neighbor  [4|, 

For  our  work  we  can  conveniently  characterize  the  control  hyperplanc  as  a  two  dimensional 
tessellation  automaton,  since  the  computation  hyperplane  is  a  two  dimensional  array  of  switches 
and  processing  elements.  To  completely  describe  the  control  hyperplanc,  we  need  to  specify  the 
neighborhood  index  and  the  local  transformation  a.  The  only  difference  from  the  example  is  that 
the  neighborhood  index  and  the  local  transformation  will  be  two  dimensional  in  nature. 


1.4  Summary  of  Results 

This  presentation  describes  the  results  of  research  into  the  topic.  No  attempt  will  be  made  to  de¬ 
scribe  the  results  in  great  detail. 


1.4.1  Design  Example 

A  complete  design  example  that  incorporates  four  representative  architectures  has  been  developed 
1 17).  These  were  chosen  to  achieve  a  variety  of  research  architectures.  Included  were  banyan  net¬ 
works  (Gork76],  a  fault  tolerant  architecture  |l’rad82|,  the  hypertree  (Goods  1|,  and  the  lens  strategy 
(fink81|.  Ilie  results  show  that  the  complexity  of  the  design  is  practical  for  present  VLSI  tech¬ 
nology. 
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1.4.2  I/O  Algorithm 

Since  the  structure  of  the  array  changes  from  architecture  to  architecture,  and  since  the  location  of 
the  active  cells  may  change  dynamically  due  to  faults,  it  is  necessary  to  provide  a  dynamic  con¬ 
nection  from  the  I/O  ports  of  the  array  to  the  active  region.  This  dynamic  path  allocation  must 
utilize  only  good  cells.  A  complete  specification  for  an  I/O  algorithm  has  been  developed. 


1.4.3  Determination  of  Fault  Free  Spaces 

A  distributed  algorithm  has  been  developed  to  determine  the  sizes  of  maximal  regions  of  fault-free 
cells  in  the  array.  Faulty  cells  are  surrounded  by  a  quarantine  wall.  Each  cell  in  the  array  passes 
a  continually  updated  version  of  a  parameter  called  the  s-value  to  each  of  its  neighbors.  When  the 
s-value  distribution  becomes  stable,  each  cell  in  the  fault-free  region  will  know  exactly  how  many 
fault-free  cells  are  available  in  all  four  directions  from  its  location.  This  distributed  space  finding 
algorithm  plays  a  crucial  role  in  the  development  of  distributed  fault-tolerant  reconfiguration  algo¬ 
rithms  for  the  proposed  structure. 


1.4.4  Fault  Tolerant  One-Dimensional  Architecture 

A  complete  set  of  algorithms,  including  the  required  transformations,  has  been  developed  for  both 
fault-diagnosis  and  reconfiguration  in  a  one-dimensional  architecture  [  16|.  Both  single  and  multiple 
faults  are  covered.  Distributed  control  is  maintained  for  all  procedures.  Results  regarding  the  rate 
of  growth  of  patterns  and  the  number  of  time  steps  needed  for  reconfiguration  have  been  derived. 


1.4.5  Multi-Dimensional  Architecture 

A  complete  set  of  algorithms  for  fault  diagnosis  and  reconfiguration  in  multi-dimensional  architec¬ 
tures  has  been  developed.  Although  applicable  to  architectures  of  any  dimension,  two  dimensional 
examples  arc  emphasized. 


1.5  Future  Directions 

This  section  identifies  problems  that  need  to  be  solved  to  make  the  proposed  architecture  more 
efficient  and  to  increase  the  range  of  applications. 

Previous  work  has  focused  on  the  use  of  the  array  for  non-real  time  applications.  High 
availability  was  the  goal.  There  are  no  provisions  for  recovery  during  a  computation.  The  system 
can  reconfigure  but  then  the  computation  in  progress  must  be  restarted  from  the  beginning.  For 
real  time  applications,  there  must  be  some  means  for  saving  data  during  reconfiguration.  Such 
applications  require  no  gaps  in  data  collection.  A  possible  solution  is  to  utilize  a  data  checkpointing 
system,  with  queues  to  hold  data  while  reconfiguration  is  in  progress.  The  problem  of  transferring 
the  state  of  the  failed  processor  to  its  replacement  is  also  a  difficult  problem.  One  possible  solution 
is  to  utilize  shadow  processors,  so  that  there  are  always  extra  processors  running  each  task  in  lock 
step.  In  this  way,  it  might  be  possible  to  quickly  switch  from  the  faulty  processor  to  a  good  one. 
Although  these  problems  have  been  studied  in  other  environments,  applying  them  to  an  array 
processor  with  embedded  control  will  require  considerable  effort. 

Although  the  problem  for  reconfiguration  in  the  presence  of  faults  has  been  extensively  in¬ 
vestigated.  no  method  for  fault  detection  has  been  formalized.  Several  possibilities  exist.  If  a 
processor  could  execute  a  self  test,  or  if  neighboring  processors  could  test  a  processor,  then  no  ad¬ 
ditional  hardware  would  be  required.  However,  this  approach  would  require  extensive  software, 
and  would  require  considerable  time.  Faults  could  be  present  for  some  time  before  detection. 
Some  real  time  applications  could  not  tolerate  this  amount  of  fault  latency.  Another  possible  sol¬ 
ution  is  to  use  redundant  processors,  each  performing  the  same  computation  in  lock  step.  The 
outputs  could  be  compared,  data  errors  would  be  detected  with  no  delay,  and  system  rcconfigura- 
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tion  could  immediately  be  invoked.  The  advantages  of  this  approach  arc  less  software  overhead  for 
testing,  and  low  fault  latency.  The  disadvantage  is  considerable  hardware  overhead.  Since  hareware 
costs  are  going  down  while  software  costs  are  going  up,  perhaps  this  last  alternative  might  turn  out 
to  best  in  the  long  run. 

Time  for  reconfiguration  is  also  a  potential  problem  in  real  time  systems.  The  current  recon¬ 
figuration  algorithms  proceed  by  clearing  out  the  entire  array  when  a  fault  is  detected.  Rcgrowth 
of  the  desired  control  pattern  then  proceeds  with  faulty  cells  quarantined.  Although  this  process 
is  acceptable  for  non-real  time  applications  where  the  computation  can  be  restarted  from  the  be¬ 
ginning  once  the  reconfiguration  is  complete,  real  time  applications  require  fast  efficient  processing. 
For  this  reason,  partial  clearing  and  partial  reconfiguration  will  probably  be  necessary.  This  will 
require  considerable  modification  of  the  reconfiguration  algorithms. 

A  final  concern  is  efficient  use  of  available  resources.  The  present  algorithms  do  not  allow 
faulty  cells  within  the  active  processing  array.  Modification  of  the  control  algorithms  to  allow  this 
condition  would  allow  more  efficient  use  of  facilities. 
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ABSTRACT 


and  henc 
are  s 1 mp I 


In  e:i  act  computation,  a  number  ot  nr  or.- 1  en>*„  i  nt  ,  1  ••• 

solution  to  which  demands  an  exhaustive  search  and  hence  ■  nr  > 
deal  o-f  computing  time.  The  algorithms  used  are  s  1  mp  i  o  hot  tin 
computation  involved  is  so  great  that  it  cannot  be  don* 
economically  on  a  large  scale  time-shared  general  purposf 
computer.  The  NOVAC  project  at  Concordia  consists  of  ; 

dynamically  variable  tree  structured  system  for  solving  a  cl.v:- 
of  combinatorial  problems.  The  proposed  mu l ti processor  s  true tun 
consists  of  loosely  coupled  processors  with  no  shared  memory., 
each  processor  in  the  system  can  be  a  master  or  a  •?.  J  a-.'e  or  both 
This  system  is  to  be  built  with  of -f-the-shel  i  mini  and  mirr- 
computers  .  and  interconnected  using  an  inexoen«i  ve  \s%.  nchr  c----  j- 


.  n  g  a  n  i  n  < 


o  c?n  ■■  \  v  >v 


bus.  The  1  o g i  c: a  1  structure,  c: o n s i  s t i  n u 


h  i  er  ar  r  t-i 


mas  ter  s . 


each  with  a  number  of  claves  is  natur 


roii  i  '1 


problems  which  can  be  split  -up  into  a  number  of  ident  •  -it 

problems.  The  time  required  to  solve  such  a  problem  rm  i  i  i  be 
depending  on  it's  sine  and  nature  hours,  days  .  *r  weet  s .  f  h.e  imit 
costs  involved  m  having  an  absolutely  reliable  system  tor  : ■  -. ... 1 
duration  necessiat.ed  a  recovery  system  using  the  buddy  concept  . 
Herein  a  failure  of  one  of  the  processor  of  a  pair  or  bndd* 
processors  could  be  tolerated.  The  last  assigned  to  the  fail ec 
processor  could  be  picked  up  from  the  last  cheef  point  r-taf 
stored  in  an  on-line  aim  liary  a  tor  age  device, 


1.  Introduction 

There  are  many  problems  in  o,.  act  comput  at  t  on  ,  r  eiy.  -.  *  r  inn 
great  amount  of  computing  time  to  solve  them.  The  comput at i one 

involved  are  simple  but  the  amount  of  computation  is  so  great 

that  it  cannot  be  done  economically  on  a  lame  scale  genei-  >  I. 


that  it  cannot  be  done  economically  on  a  lame  scale  goner  >  I. 
purpose  computer.  A  number  of  systems  Cl,  2.1  have  been  proposed 
to  exploit  the  parallelism  in  such  problems.  .-»  c  output,  ei 

structure  in  the  form  of  a  tree  has  been  proposed  in  IT!  to  solve 
problems  that  may  be  e--  pressed  with  recur-  - 1  ,e  :.].  oori  t  h  m  r  . 
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possible  to  produce  inexpensive  micro  and  super-micro  processor.? 
with  a  -fairly  significant  amount  of  computing  power.  Distribute-': 
systems  which  evolved  as  a  result,  of  this  situation,  consisf 
multiple  computers  and  devices  interconnected ,  using  a  net war I-  t* 
form  a  coordinated  system.  Such  a  distributed  system  offer  ; 
novel  approach  to  implement  fault-tolerant  and  highly  available 
computing  [8—113.  The  NOVAC  project  [63  at  Concordia  consist  s 
of  a  loosely  coupled,  NOn— tree  structured  multiprocessor  syuten 
which  can  be  dynamically  structured  into  a  VAr  i able  tree  to  solve 
a  class  of  Combi nator i al  problems.  The  structure,  consisting  of 
a  hierarchy  of  master,  each  with  a  number  of  slaves  is  natural 
for  the  set  of  tasks  which  can  be  split— up  into  a  number  ot 
identical  sub-tasks.  In  a  multi-level  system  a.  slave  can  further 
partition  its  subtask  and  pass  the  sub-subtasks  onto  •  tn  ow 
slaves.  A  prototype  of  this  system  is  built  with  off -the-Si  iej  . 
mini  -computer  <  PDF' 11/34  )  and  mi cro— computer 3  <  LSI  / 73  1  an< 

interconnected  using  inexpensive  asynchronous  bus. 


Our  design  (  figure  1  ) proposes  a  set  of  processors 
logically  connected  as  a  "tree"  and  satisfying  the  following 
conditions. 


< i )  Each  processor  is  either  a  master ,  a  slave,  or  both 
i i )  No  memory  is  shared  between  processor s . 


< i i i )  A  master  can  instruct  its  own 
s  u  b  s  e  q  u  e  n  1 1  y  o  b  t.  a  i  n  r  e  s  u  1 1  s 
comm  u  n  1  c.  a  tion  is  p  o  s  s  i  b  1  e . 


s  to  perform  t ask • 
t  h  e  s  i  a  v  e  s  s  n. 


The  capacity  of  the  communication  channels  is  not  cr 
for  this  project.  We  anticipate  that  in  a  typical.  proto 
master  will  instruct  a  slave  by  transferr  1  ng  a  few  t  f 
bytes  of  code  and  data  to  it,  and  the  slave  will  then  run 
least  several  minutes  before  returning  a  solution.  Tl 


about  the  same, 

b  e  f  o  r  e  r  e  p  1  y  1  n  g  . 


but.  the  slave  may  execute  for 
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Reliability  is  always  an  important  con  si  dor  at.  1  on  ■  t  <  v  t,*- 

design  of  a  system  which  has  to  perform  correct.ly  in  a  cr  1 1:  .1  c.  .1 1 

environment.  Physical  devices  have  an  inherent  failure  rate  and 
as  such  components  of  computer  systems  will  fail  .  The  romp  oner,  t  ~ 
in  the  current  computer  systems  are  built  using  technologies  that 
arc-  more  reliable  than  earlier  technology.  However ,  the 
probability  of  component  and  subsystem  failure  in  a  roonle- 
system  is  never  reducible  to  ?  oro.  Sv  s  t  ems  t  or  re  a  ■  1 1  .re:- 

pr  c.  ;•  s  0-:  ,  <*u  : : .  a.j  t  r-  r  •  -!  1  1  1:  coni.  1  ot  ,  .ira  < 4  l  . •  a  = 
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avoidance  (fault-intolerance)  and  the  fault  tolerance.  In  the 
fault  avoidance  method,  the  reliability  is  increased  b /  t  he 
use  of  conservative  design  guidelines,  using  high l  /  rnl 1 aol e 
components  that  have  been  subjected  to  an  initial  burn- 1  n  phase 
etc.  However,  in  spite  of  these  techniques,  failures  will 
occur . 

In  fault  tolerance,  on  the  other  hand,  the  objective  is  r or 
the  system  to  perform  to  the  desired  specification  even  in  the 
presence  of  faults.  The  use  of  redundancy  is  the  usual  approach. 
Redundancy  in  a  computing  system  could  be  considered  to  be  in  twc 
dimensions,  namely  components  or  time.  Component  redundancy  is  in 
the  form  of  extra  components,  subsytems  ar,d  systems;  1 1  ms 
redundancy  is  ususlly  provided  by  software.  The  prone-  ed  i-.  •■■■',  - 
recovery  method  falls  in  the  area  of  fault  tolerance  •••  i  t  h  "  :  m, 
redundancy. 

Most  of  the  commercially  available  systems  L 1 1 j  are  mode 
fault  tolerant  by  providing  redundant  or  spare  processors  and/or 
redundant  communication  links.  When  an  active  system  component 
fails,  its  tasks  are  dynamically  transferred  to  spare  components, 
thereby  allowing  system  operation  to  continue.  In  most  of  the 
systems  it  has  been  assumed  that  reconfiguration  and  recovery  are 
directed  by  a  central  or  global  supervisor  ,  which  is  .offer.  the 
most  vulnerable  part  of  the  system. 


Buddy  Recovery  Concept 


In  this  section,  we  propose  a  failure  recov 
for-  a  task  oriented  distributed  computing  system , 
may  spawn  subtasks  which  execute  in  other  nodes, 
assumed  to  be  capable  of  detecting  failures  of  the 
other  nodes.  The  only  symptom  of  malfunction 

processors  is  complete  failure  or  "death".  In  or de 
a  multiprocessor  system  fault  tolerant,  it  is 
develop  strategies  which  after  the  occur  non.:.-1  . 
death  will  permit  the  remaining  processors  Vo 
agree  on  a  reconf iguration  of  the  system. We  ass 
interconnection  network  will  not  tail.  tact.  .u:ci 

tests  independently  according  to  some  i  or.  ail 
schedul e . 


proce 


One  of  the  recovery  strategies  of  ten  used  1 s  a  w i  1 
place  scheme  in  which  the  system  must  save  sufficient  inter 
to  undo  the  effect  of  an  action  it  it  tai  Is.  ran:  w.-i  d 

recovery  i.e.  resetting  an  erroneous  state  of  a  system 
previous  error  free  state,  is  an  import  ant  gene.  ■  -  '<>:• 

r  v  er  /  from  f  su  I  f  ~  in  i  -  v  store.  .  •-  •  i  !  I  .  -  .. 


-mat i on 


ecover  •/  c  an  be  complex  particul  ar  1  •  it  t  h 
vs tom  i  -  multi  level  . 


In  a  conventional  multiprocessor ,  a  task  is  re-executed 
after  re— initial  is at ion  of  the  multiprocessor  when  a  permanent 
hardware  fault  is  detected.  This  imposes  a  significant  overhead, 
since  the  computation  between  the  start  of  the  task  and  the  time 
of  the  fault  is  lost.  If  the?  task  is.  distributed  on  different 
processing  units  in  the  multiprocessor  then  a  failure  in  a  sub¬ 
task.  would  imply  restarting  the  entire  task,  resulting  in  high 
initialization  overhead. 


The  proposed  system  takes  advantage  of  the  multifile 


processing  units  to  allow  recovery  from  hardware  failures. 
Briefly,  the  detection  of  a  processor  failure  by  another 
processor ,  prompts  the  replacement  of  the  faulty  processor 
this  processor;  and  resumption  of  its  task:  from  a  non-fault  . 
state  of  computation  prior  to  the  failure. 


The  concept.  of  the  buddy  system  involves  pairing  of  two 
processors.  A  processor  F'i  ,  which  is  given  a  task  is  paired  with 
another  processor  Pj  (=Pto*  )  .  F'j  may  also  be  assigned  another 
task  and  paired  with  processor  P,((Pbj)  .  The  buddy  processor  PBi  , 
is  responsible  for  testing  the  state  of  the  processor  Pbl  .  If  F'bl 
finds  the  processor  Pt  in  a  failed  state  then  it  initiates  the 
recovery  procedure.  The  task  which  was  being  performed  by  the 
failed  processor  (Pt )  is  taken  over  by  its  buddy  processor  iPtoi » . 
Sub-tasks  initiated  by  failed  processor ,  P, ,  can  proceed 
normally  without  being  re -initialized.  Calls  for  the  downloading 
of  the  tasks  and  the  state  save  points  are  designed  such  that  the 
buddy  processor  can  recover  the  task  from  an  advanced  state  of 
comp ut at  1  on . 

The  type  of  fai  lure  which  we  are  considering  is  the  t  at  lure 
of  a  processor  while  it  is  executing  a  task.  The  basic;  idea  o  + 
the  buddy  failure  recovery  concept  is  the  following.  For  each 
task  t i  that  is  assigned  to  a  particular  processor  P,  a  budd> 
processor  P&  is  selected.  The  buddy  processor  P* ,  is  cognizant 
of  the  status  of  the  processor  P,  and  the  task  tt  ,  including  anv 
of  i  ts  sub  tasks  t,  In  addition  the  processor  F  ,  s+u-tro'*  J--h- 
st  ate  of  the  task  t,  ,  at  regular  interval  u  m  an  on  - 1  i  rv» 
auxiliary  storage  device.  This  device  is  accessible  to  the  buddy 
processor  P„.  In  the  case  of  a  failure  of  P  this  enables  i  t  >-■ 
buddy  processor  P„  to  recover  from  the  last  non  •-+  aui  t  y  state 
stored  by  the  failed  processor  rather  than  re-executing  the  f  a  ,>i 
which  was  being  executed  by  the  processor  that,  railed.  fU  go  the 
system  design  allows  normal  continuation  of  any  sub-task  ti., 
created  by  the  processor  P  before  its  fai  lure,  and  the  r  opor  tit  in 
of  its  results  are  forwarded  to  Pt. . 


Selection  of  the  buddy  processor  is  left  to  the  processors; 
e  v  or  .  r  I  ave  prot  essor  ,  on  recr-i  >.  i  ng  a  t  ■  -  t-  fu  be  yr  ■  ►  t  or  .n.  •<-!  1  r  •-•m  .  ■ 
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ARCHITECTURE  *  COMMUNICATION  REQUIREMENTS 


A  reliable  and  survivable  interconnection  system  must 
■feature  good  operational  security  at  three  levels:  the 

architectural  or  structural  level,  the  physical  implementation 
level  and  the  communication  control  level. 

In  the  proposed  system  tasks  are  executed  by  a  set  of 
physically  distributed  computing  elements.  These  computing 
elements  must  communicate  with  each  other  in  order  t.o  co-ordinate 
their  actions  so  as  to  achieve  the  global  system  objectives. 
Each  processor  has  its  own  private  main  memory,  and  on-line 
secondary  storage.  The  latter  is  not.  only  connected  to  tele¬ 
processor  but  also  to  the  global  bus  by  its  device  control  1  o-* .  r> 
processor ,  as  long  as  it  is  'alive'  has  exclusive  access  to  the 
on-line  storage.  On  the  failure  of  the  processor  the  control  1  Di¬ 
al  lows  other  processors  to  access  the  on-line  storage  in  a  react 
only  mode. 

The  global  bus  structure  (Figure  1)  allows  direct 
communication  between  processors  and  maps  the  logical 
master/slave  structure.  Connecting  the  on-line  storage  to  the 
bus  allows  the  buddy  processor  (P** )  to  read  the  last  state  saved 
by  the  processor  <Pt )  before  its  failure.  Using  this  methodology 
reduces  the  load  on  communication  network  since  it  does  not 
require  every  state  save  to  be  sent  to  the  buddy  processor . 

(i)  Di  rect  communication  is  possible  between  any  tw.,» 
processors. 

<i i >  Failure  of  any  one  or  more  of  the  processors  does  not 
interfere  in  any  way  with  the  communication  among  the 
remaining  processors. 

(iii)  In  the  case  of  a  processor  failure,  the  on-line 
auxiliary  storage  of  the  failed  processor  is  acces<- 1  h  i 
to  other  processors  for  read  purposes . 


IMPLEMENTATION 


The  user  program  is  loaded  onto  the  root  processor ,  which 
assumes  the  role  of  the  over a 1 l  master.  The  master  processor 
sub-divides  the  given  task  into  sub-tasks  and  sub -s uh - 1. as l  c 
according  to  the  users  directives  as  specified  in  the  source 
program.  The  master  processor  also  generates  the  machine  code  o+ 
each  individual  task  i.e.  *nr  each  sub -task  in  the  system.  I - 
ii  i  so  associates  a  unique-  i  ck-n  1 1  tier  with  each  cur-  i  a*  <  . 
st'-'p  is  broadcast  i  ng  o»  a  l  i.  sub  -  task  s  along  w  i  th  th-i  r  .  ;  -i- 

identifier  by  the  master  processor .  Ibis  broadcae  t  .  -  pi.  i-  ■,  i  .  ■  . 
other  pr  ’ceasors  in  the  networi  and  stored  on  their  roc peer  t  > 
or  -  L  i  ne  secondary  storage.  Ibis  one  ti  me  initial  hrrwiii.  r.  •  ■ 

storage  of  sub-tasks  erab  les  the  s  -stem  in  reduc  i  na  fiu>  i  oan  or- 


the  communication  network  during  the  execution  stage  since  i 1 
would  not  be  necessary  t or  a  processor  to  pass  the  code  o-t  rnr 
sub-task  to  a  slave  processor.  During  the  execution  period , 
whenever  a  processor  needs  another  processor  to  act  as  its  slave 
processor  and  execute  a  sub-task,  it  broadcasts  a  request  tor  -■ 
slave  over  the  bus,  and  an  idle  processor  responds  b, 
transmitting  its  id.  The  master  then  passes  onto  the  slave 
processor  the  following  information:  the  processor's  '.master* 

ID,  the  ID  of  its  buddy,  the  task  ID  and  the  associates 
parameters. 

The  slave  processor  chooses  a  buddy  processor  and  sends  the 
buddy  processor's  ID  to  the  master  processor,  which  is  al-.r 
received  by  the  master  processor's  buddy. 

As  a  result  of  the  above  specified  sequence,  the  si? vs 
processor  knows  the  identifications  of  its  buddy  processor ,  oi 
its  master  processor  and  of  the  buddy  processor  of  its  master 

processor . 

Whenever  any  processor  creates  a  slave  to  execute  a  sub¬ 
task,  it  informs  its  buddy.  This  will  keep  thee  buddy  processor 
cognizant,  of  all  the  tasks  down  loaded  by  the  processor ;  the  IT 
of  the  slave  and  the  slave's  buddy.  In  case  or  a  processor 
failure,  this  i s  needed  by  the  buddy  processor  since  it  would  hr 
receiving  the  results  from  the  tasks  being  performed  by  the  si 
processors  of  the  failed  processor. 

On  completion  of  the  task  the  slave  processor  needs  ic 
report  the  results  back  to  its  master  processor ;  if  it  does  not 
receive  an  acknowledgement  back  from  the  master  a  +  te*  a 
prespecified  time  it  sends  the  results  to  the  master  processor 
buddy . 

Recovery  in  case  of  the  failure  of  a  processor  is  initiated 
by  its  buddy  processor .  When  a  processor  detects  the  failure-?  of 
its  buddy  it  may  be  in  either  of  two  states  —  idle  or  nur  „  j  * 
it  is  i  n  an  idle  state  then  it  starts  processing  the  buddies  t  •*•?! 
otherwise  it  first  finishes  the  task  i t  is  performing  and  ther 
takes  up  the  failed  processors  task:. 


6.  Conclusion 

In  the  buddy  recovery  scheme,  at  processor  assi  gned  a  tael 
depends  on  its  buddy  processor  for-  the  completion  of  the  task  at 
hand  in  case  of  a  failure.  The  processor  common i cat  e< 

per  i  od  i  ca.l  1  y  with  the  buddy  processor  to  indicate  that  it  i  >- 
aJ  i  we  '  .  Failure  of  this  communication  initial  e'  r  <■■■■■?• 

proced'  ire  in  the  buddv  t'<  re1  i  •  i  «■*  a  the  i  act  c  or  r  i-  . . «  ••• 

t  I  '  ►—  l  afiij  'll':  t  .1  j-i|- ,  !('  i.  |  l,  .'nil  I:  1'1  . .  V  ' 

r  oc  omput  at i on  o+  the  task  < and  its  sub -task  s*  are  not*  r  enu i * 

I  he  scheme  is  lies  l  tine*- 1  for  a  I  ai  1  uro  in  the  iMc.cer  •  i n  ' 
and  not  i  n  the  communication  faci  I  i  1 1  es  or  the  second  c  .  s  t  or  -aq- 


device.  Schemes  used  in  commercial  fault  tolerant  systems  :< 
dual— bus,  dual  port  device  controller  and  dual -portr-ri  -je-.  i  <-t?K 
are  too  expensive  for  a  computation  i  n t e n s  i  v s  e n  --  ir  o n m e n t 
requiring  hundreds  of  processors. 

The  implementation  of  the  recovery  scheme  is  in  progress  on 
a  prototype. 
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1.  Introduction 

The  Clouds  project  at  Georgia  Tech  was  initiated  to  conduct  research  into  failure  resistant,  efficient  distri¬ 
buted  architectures  and  operating  systems.  The  project  used  stateof  the  art  techniques  to  design  a  distributed 
operating  system  kernel  that  can  be  supported  on  conventional,  unreliable  hardware,  and  be  more  reliable  than 
the  underlying  electronics.  Several  approaches  to  the  problem  were  considered,  and  after  substantial  research 
and  construction  effort,  the  current  design  emerged. 

This  paper  presents  an  overview  of  the  Clouds  operating  system.  Clouds  is  a  distributed  operating  system 
providing  support  for  fault  tolerance,  location  Independence,  reconfiguration  and  transactions.  The  implementation 
paradigm  usee  objects  and  nested  actions  as  building  blocks.  The  paper  also  discusses  subsystems  and  appli¬ 
cations  that  can  be  supported  by  Clouds  to  further  enhance  the  performance  and  utility  of  the  system. 

2.  The  Clouds  Project 

In  recent  years  there  has  been  considerable  interest  in  developing  distributed  computing  systems.  Distribu¬ 
tion  of  computing  resources  suggests  many  possible  benefits  including  greater  flexibility,  enhanced  computing 
power  through  greater  parallelism  and  application  of  more  resources,  and  increased  reliability.  But  the  promise 
of  distributed  systems  have  been  tarnished  by  a  large  set  of  problems  that  accompany  such  systems.  The  Clouds 
project  was  initiated  with  a  determination  to  overcome  these  problems  through  clear,  simple,  elegant  design,  that 
would  lead  to  a  integrated,  useful  distributed  system. 

2.1.  Toward  Object  Oriented  Systems  Design 

The  approach  currently  advocated  by  the  research  community  is  an  object  based  architecture  (see  section 
2.3  and  2.4).  Every  system  (and  user)  function  on  an  object  based  system  is  encapsulated  in  objects.  The  data 
these  operations  access  are  also  encapsulated  in  the  same  object  as  the  operation.  These  objects  form  units  of 
synchronized) le  shared  data  structures  and  units  of  recoverable  data  in  case  of  crashes. 

The  conceptual  mechanism  to  access  objects  are  procedure  calls.  However,  most  conventional  systems  do 
not  have  the  support  to  allow  remote  procedure  call  (RPC)  mechanisms  needed  to  support  object  invocations. 
Thus  a  solution  to  implementing  objects  on  conventional  operating  systems  is  to  use  the  active  object  paradigm. 
In  this  technique,  a  process  is  associated  with  every  object,  and  receives  requests  (via  messages)  from  other 
process  wishing  to  access  the  object.  The  object  handling  process  is  complex,  with  abilities  of  interleaved  execu¬ 
tion,  non-blocking  receives  and  some  parallelism  (as  it  has  to  handle  several  clients  at  the  same  time). 

This  is  the  approach  adopted  in  the  the  Eden  system  at  Washington  [Alme83],  the  CRONUS  system  at 
BBN  [Schn85]  Argus  at  M.l.T.  [Lisk83a]  and  several  other  systems.  Although  it  is  better  at  failure  tolerance  than  a 
message  based  system,  this  approach  still  has  its  pitfalls,  such  as  the  way  the  object  manager  has  to  be  imple¬ 
mented.  It  also  uses  a  large  number  of  processes  (one  per  object),  and  has  complex  process  handling  code. 

Support  for  distributed  object  based  systems  is  built  into  the  Argus  language  that  is  supported  on  the 
Argus  system.  Argus  uses  guardians  and  objects  to  create  conceptual  nodes.  However,  most  of  the  object  calls 
in  Argus  are  handled  through  messages  and  slaves  [WeLi83]. 


Clouds  uses  the  object  based  approach  with  passive  objects.  The  invoking  process  carries  its  thread  of 
execution  into  the  object.  Further,  Clouds  does  not  stop  at  objects  being  the  sole  units  of  recovery.  Object  are 
units  of  recoverable  data,  actions  are  units  of  recoverable  active  components.  The  Clouds  paradigm  of  using 
actions  and  objects  for  a  reliable  distributed  system  has  a  potential  for  large  payoffs  due  to  its  simplicity,  ease  of 
implementation  and  efficiency. 

2.2.  Th«  C/oucte  Architecture 

The  architecture  of  a  distributed  system  can  be  partitioned  into  two  main  areas:  The  hardware  configuration 
and  the  operating  system  structure.  Clouds  is  designed  to  run  on  a  set  of  processors,  loosely  coupled  over  a 
medium-to-high-speed  network.  The  prototype  configuration  is  shown  in  Fig.  1. 

The  prototype  consists  of  three  VAX/750  computers  with  3  Meg  memory  connected  through  a  high  speed 
Clustar  Interconnect  (Cl).  The  front  end  network  consists  of  an  Ethernet.  This  is  the  Clouds  gateway  to  the 
external  world.  (Currently,  both  the  back  end  network  and  the  front  end  network  are  on  the  same  Ethernet.) 
Users  access  the  Clouds  system  through  desktop  computers  (IBM-PC/XT  computers  in  our  prototype)  through 
the  Ethernet.  The  disk  drives  used  for  secondary  permanent  storage  are  dual  ported.  This  allows  reassignment 
of  disk  drives  in  case  of  processor  failure  and  helps  in  reconfiguration.  The  front  end  Ethernet  also  serves  as  a 
vehicle  for  easy  transfer  of  user-processor  assignments.  In  case  of  failures,  users  can  be  virtually  transparently 
floated  to  another  serviceable  processor. 

13.  Objects 

The  operating  system  structure  of  Clouds  is  based  on  the  action/object  paradigm.  All  permanent  system 
components  in  Clouds  are  objects.  Objects  form  a  clean  conceptual  encapsulation  of  data  and  programs.  They 
are  useful  in  providing  synchronization  and  recovery  as  will  be  described  later.  The  objects  are  accessed  by 
processes  on  behalf  of  actions. 

In  a  simplistic  view,  an  object  is  an  instance  of  an  abstract  data  type  (cosmetically  similar  to  modules  in 
Modula-2  or  classes  in  Simula).  The  object  encapsulates  permanent  data  and  a  set  of  routines  that  can  access 
(read  or  update)  the  data.  The  only  access  path  to  the  data  contained  in  the  object  is  through  the  routines  (or 
operations)  defined  in  the  object.  To  the  external  world,  the  object  is  thus  an  entity  providing  a  set  of  entry  points. 
(Jo79,  Wulf74] 

Clouds  object  are  more  powerful  than  just  a  black  box  containing  some  procedures  and  static  data.  It  also 
contains  a  stack  (temporary  data),  heap  (permanent  dynamically  allocated  data),  and  powerful  support  for  con¬ 
currency  control  and  recovery.  An  outline  of  a  Clouds  object  is  shown  in  Fig.  2. 

2.4.  Actions 

Actions  are  partially  ordered  sequences  of  operations  on  objects  that  transform  the  state  of  the  objects  from 
one  consistent  state  to  another  Actions  are  used  to  specify  units  of  synchronization  and  recovery. 

A  user  on  the  Clouds  system  can  start  up  a  transaction.  A  user  transaction  is  a  top-level  action.  A  top  level 
action  is  an  atomic  unit  of  work,  that  either  effectively  terminates  and  causes  permanent  updates  to  recoverable 
objects,  or  does  not  leave  a  trace. 

A  top  level  action  can  spawn  more  actions  or  subactions.  These  are  nested  actions.  They  can  run  con¬ 
currently  with  other  subactions  and  the  top  level  action,  or  they  can  be  spawned  sequentially.  The  subactions 
inherit  the  locks,  and  the  views  of  the  parent  action,  but  execute  independently.  The  subactions  may  abort  or 
commit  (conditional  commit),  and  the  termination  status  is  returned  to  the  parent.  A  top  level  action  may  decide  to 

t  This  research  Is  funded  In  part  by  NASA  grant  NAG-1-430  and  by  NSF  grant  DCR-8316590. 
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abort  or  to  commit  after  afl  tha  nasted  actions  have  terminated. 

* 

14.  tywchronliatlon  and  Hseswry 

Cloud s  uses  locks  as  a  basic  mechanism  for  synchronization  amongst  actions.  (The  locking  mechanisms 
are  not  ingrained  into  the  Clouds  design;  some  other  forms  of  synchronization  such  as  timestamps  can  easily  be 
be  substituted.)  The  object  designer  can  classify  an  object  as  synchronized  or  non-synchronlzed .  Synchronized 
objects  are  automatically  synchronized  using  the  2-phase  locking  paradigm.  The  synchronization  in  non- 
synchronized  objects  is  left  to  the  programmer.  Synchronization  primitives  such  as  locks  and  semaphores  are 
provided  for  do-tt-yourseifers.  This  allows  semantic  based  synchronizations. 

When  an  object  is  defined  as  synchronized,  the  object  compiler  includes  default  code  in  the  object  entry 
and  exit  points  to  adhere  to  the  2-phase  locking  protocol.  Each  operation  Is  classified  as  read  or  update  opera¬ 
tion  depending  upon  the  semantics  of  the  operation.  Invoking  a  read  operation  causes  the  invoking  action  to 
acquire  a  read  lock  on  the  object  (if  possible,  or  the  process  waits  until  the  lock  is  obtained.)  Similarly  invoking  a 
writs  operation  causes  the  acquisition  of  a  write  lock,  or  the  upgrade  of  a  pre-existing  read  lock  to  a  write  lock. 

Locks  are  not  released  when  the  operation  returns  or  terminates,  but  are  released  by  the  commit  phase, 
which  Is  also  handled  by  the  object  as  described  below. 

Objects  can  also  be  defined  to  be  recoverable  or  no  n-recov  arable.  That  is  any  modification  to  the  object 
is  not  made  permanent  until  the  action  that  caused  the  modification  successfully  terminates.  Recoverable  objects 
have  default  operations  “Commit”  and  “Abort”  defined  in  them.  One  of  these  operations  is  invoked  automatically 
depending  upon  the  success  or  the  failure  of  the  action.  Commit  causes  the  updates  to  be  made  permanent  and 
abort  obliterates  changes. 

2.1.  TIm  Salient  Features  of  Clouds 

Distributed  system  architecture  has  been  a  research  topic  for  some  time  now.  The  basic  choice  for  a  distri¬ 
buted  environment  gaining  popularity  is  the  object  based  approach.  Coupled  with  actions,  objects  are  a  powerful 
and  yet  simple  concept,  that  is  well  suited  to  handle  distributed  computing  environments.  However,  not  many  pro¬ 
totypes  have  emerged,  although  some  are  under  construction.  Some  of  the  notable  ones  are  the  TABS  project  at 
CMU  [Spec84],  Argus  at  MIT  (Weih83,  Lisk83a,  Lisk83b],  Zeus  at  Austin  and  Bloomington  [Brwn83],  CRONUS 
at  BBN  [Alme83],  Archons  at  C.M.U.  [Jens82]  and  Eden  at  Washington  [Alme83].  The  current  state  of  the  art  in 
constructing  distributed  operating  systems  and  environments  Is  definitely  in  this  direction. 

What  all  these  projects  have  in  common  with  Clouds  Is  the  usage  of  objects  and  actions.  The  object  is  a 
powerful  construct  that  can  effectively  used  to  handle  some  of  the  problems  of  distributed  environments.  An 
object  not  only  encapsulates  data,  but  also  the  operations  that  can  be  invoked  on  the  data.  The  data  contained  in 
the  object  can  only  be  accessed  by  the  operations  or  functions  defined  in  the  object.  Thus  the  object  can  control 
access  to  the  data  in  a  fashion  the  programmer  designed  it  to.  Note  that  even  though  the  object  itself  may  be 
passive,  the  object  can  control  access  and  enforce  synchronization  though  semaphores  or  equivalent  mechan¬ 
isms. 

Actions  are  used  to  preserve  consistency  and  avoid  partial  executions.  Actions  also  take  care  of  the  failure 
recovery  problem  inherently.  Since  a  failed  action  never  leaves  a  mark  on  the  system,  and  an  action  can  success¬ 
fully  commit  only  if  ail  its  components  executed  successfully,  the  state  of  the  permanent  storage  is  always  con¬ 
sistent. 

Clouds  is  different  from  the  mentioned  object  based  systems  in  several  respects.  First  it  implements  passive 
objects.  In  the  pure  object  based  system  the  object  is  In  fact  a  passive  segment  of  code  and  data,  that  has  no 
thread  of  execution  in  it  unless  Invoked  by  a  process.  The  object  can  support  as  many  threads  of  execution  inside 
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it  as  there  are  concurrently  invocating  processes  (if  the  synchronizations  rules  allow.) 

The  basic  primitive  supported  by  an  object  based  system  is  object  invocation.  This  is  the  most  frequent 
activity  done  by  the  system.  Some  systems  overlay  the  object  invocation  mechanisms  on  top  of  a  conventional 
operating  system.  In  this  case  each  object  invocation  has  to  trickle  down  though  several  layers  of  software  in 
order  to  work.  The  Clouds  approach  is  to  build  object  invocation  as  a  primitive  supported  at  the  lowest  level  of 
the  operating  system,  namely  the  subkernel. 

The  action  management  is  also  supported  at  a  low  level  In  the  kernel.  Thus  the  basic  function  of  the  Clouds 
kernel  are  object  invocation,  action  handling,  synchronization,  recovery  and  commit. 

In  Clouds  both  hardware  and  software  failures  are  modeled  by  the  same  paradigm.  All  failures  lead  to 
basically  one  or  more  aborted  actions.  The  fault  tolerance  is  achieved  by  reconfiguring  the  system  and  restaring 
failed  actions  after  the  failure  is  detected. 

Reconfiguration  capabilities  are  another  Clouds  novelty.  This  comes  in  two  flavors.  Upward  reconfiguration 
takes  place  when  hardware  entities  (processors,  disks,  network  connections)  are  added  to  the  system.  Downward 
reconfiguration  is  the  opposite.  Downward  reconfiguration  can  be  automatic  in  case  of  detected  failures  or  forced 
In  cases  of  premeditated  shutdowns.  The  ability  of  the  system  to  reconfigure  itself  gives  rise  to  two  significant 
advantages.  In  case  of  crashes  it  allows  normal  system  functions  to  continue.  Also  system  components  can  be 
selectively  disconnected  from  the  live  system  for  hardware  maintenance  or  software  maintenance  (replacement  of 
kernel  level  code)  without  affecting  any  activity  on  the  system. 

3.  Subsystems  and  Applications 

Clouds  provides  a  generalized  support  layer  for  many  applications  and  subsystems  that  effectively  utilize 
the  power  of  the  object  based,  action  oriented  environment.  In  this  section  we  briefly  discuss  the  implementation 
of  a  probe-based  monitoring  subsystem  that  enhances  the  fault  tolerance  of  Clouds  and  a  relational  database  that 
utilizes  Clouds  support  for  a  reliable  distributed  data  management  system. 

3.1.  Probo  based  System  Monitoring 

The  key  to  improved  fault  tolerance  lies  in  the  implementation  of  a  mechanism  for  the  system  to  monitor 
itself.  The  monitoring  can  be  at  several  levels,  discussed  later,  but  the  basic  components  of  the  monitoring  system 
are  probes. 

Probes  in  Clouds  are  a  form  of  emergency  status  enquiries,  that  can  be  sent  from  a  process  to  an  object  or 
to  another  process.  When  a  probe  is  sent  to  an  object,  the  probe  causes  the  invocation  of  a  probe-procedure 
defined  by  default  in  the  object.  The  probe  procedure  returns  to  the  caller  a  status  report  of  the  object.  This 
includes  the  status  of  the  synchronization  mechanisms,  the  actions  currently  executing  in  the  object  and  other 
relevant  information  (Fig.  3).  Probes  can  also  be  sent  to  processes  or  actions.  A  process  does  not  have  to  expli¬ 
citly  receive  a  probe.  The  probe  causes  a  process  thread  of  control  to  jump  to  the  probe  handler,  irrespective  of 
the  current  status  of  the  process. 

The  probe  handler/procedures  are  scheduled  and  executed  at  higher  priorities  than  the  regular  process 
scheduling  priorities.  Since  the  probes  cause  an  immediate,  asynchronous  reply,  and  the  probes  are  not 
suspended  by  synchronization  mechanisms,  the  time  taken  by  the  probe  to  return  to  the  sender  is  not  dependent 
on  unpredictable  conditions,  or  heavy  processing  loads.  Thus  timeouts  can  be  quite  effectively  used  for  receiving 
replies  from  probes.  Thus  using  this  scheme  we  can  detect  the  difference  between  a  dead  machine  and  a 
slow  one,  with  a  high  degree  of  accuracy.  A  primary/backup  paradigm  using  probes  can  be  implemented  to  form 
the  basis  for  implementing  fault  tolerant  actions. 
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3J.  System  Health  Monitoring  using  Probos 

Ths  probe  system  can  be  effectively  used  to  implement  a  primary/backup  paradigm  that  achieve  fault 
tolerant  actions  and  a  system  monitoring  subsystem  that  keeps  a  health  status  of  the  system.  The  monitoring  sys¬ 
tem  integrates  with  the  reconfiguration  system  in  a  manner  that  enhances  the  failure  tolerance  of  the  system. 
Details  are  omitted  for  the  sake  of  brevity.  The  Interested  reader  is  referred  to  [Da86].  The  monitoring  system  is 
depicted  in  Fig.  4. 

3.3.  An  Objoot  Basod  Database  System 

One  of  the  notable  differences  in  structure  between  conventional  database  systems  and  a  system  supported 
by  Clouds  is  the  storage  mechanism.  Instead  of  files,  we  have  a  more  powerful  construct  namely  objects.  In  the 
following  sections  we  describe  how  to  implement  a  database  system,  using  the  object  paradigm.  Subsequently 
we  discuss  approaches  to  implement  concurrency  control  and  transaction  commit  for  the  database  objects  and 
transactions  under  the  Clouds  environment.  We  also  provide  insights  into  the  effective  management  of  the  distri¬ 
buted  database  and  how  to  provide  support  for  data  replication  ( Clouds  does  not  support  replication). 

Virtually  any  kind  of  database  system  can  be  supported  in  the  object  based  architecture.  However  to  avoid 
getting  into  all  the  design  approaches  for  various  data  modeling  paradigms,  we  choose  to  discuss  the  most 
popular  database  model,  ths  relational  database  modal.  The  approaches  for  implementing  other  models  would 
be  different,  but  can  be  derived  from  the  basic  Ideas  in  our  discussions. 

3.3.1.  Objects  and  Relatione 

The  basic  building  blocks  in  a  relational  database  are  relations  and  the  relational  operators  that  access  the 
relations.  At  a  slightly  lower  level  are  the  access  mechanisms  used  for  fast  access  to  individual  or  groups  of 
tuples  in  the  relational  tables  using  key  searching,  indexing  or  hashing  techniques. 

The  straightforward  way  to  implement  an  object-based  relational  database  system  is  to  use  a  relation  per 
object  scheme.  An  object  holds  all  the  data  of  the  relation  and  contains  the  access  mechanisms  to  access  the 
data.  Thus  the  object  defines  operators  that  do  key  lookups,  projections,  tuple  insertions,  tuple  deletions  range 
queries  and  other  such  operations  on  the  objects.  A  good  feature  of  this  approach  is  that  the  object  can  be 
encapsulated  and  be  independent  of  any  systemwide  definition  of  structure  or  storage  mechanisms.  The  internal 
structure  of  the  object,  that  is  the  data  organization  (binary  tree,  B-tree,  table  unsorted),  is  not  visible  to  the  data¬ 
base  system  from  outside  and  thus  different  relations  can  be  organized  in  different  ways  and  yet  look  functionally 
identical.  The  organization  of  each  object  could  be  tailored  to  the  method  that  suites  the  data  contained  and  the 
size  of  the  object.  Organization  of  an  object  using  this  scheme  is  shown  in  Fig.  5. 

There  are  some  disadvantages  to  organizing  the  relations  in  this  manner,  we  have  designs  that  solve  most 
of  the  problems.  Concurrency  control  and  recovery  methods  integrate  well  with  the  automatic  support  provided 
by  Clouds.  Our  approach  yielded  some  surprisingly  clean  interfaces  to  between  various  types  of  objects  and  the 
database  manipulation  routines.  Most  of  the  details  of  the  design  and  structuring  of  the  database  system  is  omit¬ 
ted  for  brevity. 

Using  the  object  based  approach  we  can  determine  locking  granularities  local  to  the  object  and  use  frag¬ 
mentation  schemes  if  necessary  in  a  manner  transparent  to  the  relational  access  routines  and  database  applica¬ 
tion  programs.  Also  replication  can  be  effectively  handled.  An  overview  of  a  scheme  that  achieves  good  locking 
granularities  through  fragmentation  is  shown  in  Fig.  6.  For  further  information  about  the  techniques,  the  reader  is 
referred  to  [DaMo86] 


3.4.  Potential  Research  Issuos 

Most  of  the  potential  research  areas  that  will  lead  to  successful  designs  of  distributed  operating  systems 
have  been  discussed  above,  along  with  suggestions  of  some  solutions  involving  Clouds  philosophies.  To 
reiterate,  the  important  aspects  that  need  further  research  and  could  benefit  from  innovative  techniques  are  the 
following. 

1]  Development  of  an  innovative  architecture  for  a  distributed  system  using  the  Clouds  approach.  A  hierarchy 
of  clusters  seem  to  be  the  best  approach.  Each  cluster  should  be  a  distributed  system  which  is  object 
based  and  supports  atomic  actions.  Intracluster  machine  Independence  and  fault  tolerance  is  necessary. 

2]  In  case  of  detected  faults  or  emergency  faults,  corrective  measures  include  recovery  and  reconfiguration. 
Recovery  (or  short  term  reconfiguration)  ensures  that  vital  functions  of  the  systems  keep  ticking,  no  matter 
what  happened.  Also  long  term  reconfiguration  techniques  would  help  in  fault  isolation,  repair  and  recom¬ 
mission  of  faulty  components. 

3]  Redundancy  or  replication  is  a  time  honored  method  of  achieving  reliability.  In  some  systems,  reliability  is  of 
high  concern  and  this  traditional  scheme,  coupled  with  state  of  the  art  techniques  will  give  rise  to  the  best 
reilability/performance  figures.  How  to  Integrate  the  two  is  not  a  trivial  problem,  and  need  investigation. 

4]  The  primary/backup  paradigm  is  a  powerful  technique  for  fault  tolerance.  This  needs  extension  to  handle 
more  situations  than  just  one  failure  at  a  time.  Also  the  probes  needed  to  implement  this  scheme  can  be 
used  for  a  variety  of  other  purposes.  There  seems  to  be  some  good  research  directions  here  that  need  to 
be  explored. 

5]  The  Clouds  protection  mechanisms  will  be  used  to  support  multilevel  security.  The  prototype  does  not 
currently  support  broadcast  medium  encryption  and  digital  signatures.  These  schemes  will  be  designed, 
built  and  tested,  along  with  testing  of  their  effect  of  general  system  performance.  The  tradeoff  issues  have 
to  be  identified,  and  the  critical  messages  that  need  digital  signatures  have  to  be  formally  defined. 

This  scheme  can  then  be  extended  to  public  broadcast  channels  using  data  encryption  and  digital  signa¬ 
tures  for  guards  against  unauthorized  eavesdropping,  message  forging  and  penetration  of  the  systems. 

6]  Implementation  of  applications  subsystems,  like  the  database  system  outlined  above  that  effectively  uses 
Clouds  support  for  enhancing  system  utility. 
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Introduction 

This  paper  provides  an  overview  of  the  Loral  LDF  100  (tm)  dataflow  parallel  pro¬ 
cessor.  In  this  section  we  will  describe  the  major  design  goals  of  the  LDF  100.  In  sec¬ 
tion  1  we  will  discuss  the  LDF  100  dataflow  programming  model  and  in  section  2  we  will 
discuss  the  LDF  100  hardware  architecture. 

Biographers  usually  cover  the  childhood  of  their  subject  since  an  understanding  of 
the  person’s  beginnings  are  important  in  understanding  what  they  became.  Just  as  it  is 
important  to  understand  the  childhood  of  a  historical  figure,  the  examination  of  a  com¬ 
puter  architecture  requires  an  understanding  of  what  the  designers  of  a  computer  system 
attempted  to  achieve. 

The  LDF  100  was  designed  for  applications  in  high  speed  real  time  processing  (e.g., 
telemetry  and  signal  processing)  and  numeric  simulation.  Listed  below  are  some  of  the 
major  design  goals  of  ihe  LDF  100: 


Software 


The  parallel  processor  must  use  a  large  grain  dataflow  programming  model. 

It  must  be  easy  to  program  the  parallel  processor.  This  includes  a  good 
software  development  environment  (i.e.,  a  UNIX  development  environment). 

It  must  be  possible  to  use  standard  "off  the  shelf’  compilers  for  standard 
languages  like  C  and  Fortran. 

Hardware 

The  system  must  use  'bff  the  shelf'  hardware  (i.e.,  no  custom  VLSI) 

The  parallel  processor  must  provide  a  clean  interface  to  the  existing  Loral 
Instrumentation  telemetry  system  (the  ADS  100). 

The  parallel  processor  must  be  expandable  from  a  few  processors  to  more  than 
one  hundred  processors. 

Although  these  design  goals  are  divided  into  software  and  hardware  categories, 
these  categories  are  not  necessarily  exclusive.  The  construction  of  an  incrementally 
expandable  parallel  processor  effects  both  hardware  and  software.  The  hardware  must 
provide  support  for  the  incremental  addition  of  processors  and  the  software  must  allow  a 
program  to  run  efficiently  on  a  large  processors  arrays. 
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1.  The  LDF  100  Software 
The  Dataflow  Programming  Model 

Most  parallel  processors  can  be  divided  into  one  of  two  categories:  those  systems 
that  use  the  dataflow  programming  model  and  those  systems  that  are  based  on  the 
Communicating  Sequential  Process  (CSP)  model  of  computation. 

The  dataflow  model  views  a  program  as  a  dataflow  graph.  The  nodes  of  the  graph 
consist  of  the  parts  of  the  program  the  can  be  executed  in  parallel.  These  nodes  are  con¬ 
nected  by  arcs.  Data  is  communicated  between  the  graph  nodes  on  these  arcs.  A  node 
in  the  dataflow  graph  can  execute  as  soon  as  it  receives  the  necessary  data. 

In  both  a  formal  and  a  practical  sense  dataflow  is  a  subset  of  the  CSP  model.  A 
network  of  communicating  processes  can  be  structured  like  an  asynchronous  dataflow 
graph.  Despite  the  overlap  between  the  two  models,  CSP  based  systems  are  usually  not 
structured  like  dataflow  systems.  The  CSP  model  was  originally  developed  to  describe 
processes  in  uni-processor  operating  systems  and  most  CSP  based  systems  use  a  von 
Neumann  programming  model. 

Compared  with  classically  structured  CSP  based  systems,  the  dataflow  program¬ 
ming  model  has  a  number  of  advantages.  In  a  dataflow  program  control  is  contained  in 
the  dataflow  graph.  This  graph  is  distributed  throughout  the  parallel  processor,  so  con¬ 
trol  is  also  distributed.  Dataflow  graphs  are  asynchronous,  so  pipelining  is  easily  sup¬ 
ported.  In  contrast,  CSP  based  systems  are  synchronous,  which  makes  pipelining  diffi¬ 
cult. 

Since  the  dataflow  programming  model  is  a  "non-von  Neumann"  programming 
model,  memory  can  be  distributed  throughout  the  parallel  processor.  The  nodes  of  a 
dataflow  graph  process  the  data  that  they  receive  on  their  input  arcs.  In  most  dataflow 
systems  this  data  arrives  via  a  communication  network  or  bus. 

Small  Grain  vs.  Large  Grain  Dataflow 

Most  University  research  has  concentrated  on  small  grain  dataflow.  In  small  grain 
dataflow,  the  nodes  of  the  dataflow  graph  consist  of  the  atomic  instructions  of  the 
dataflow  machine.  The  LDF  100  uses  large  grain  dataflow.  In  this  section  we  will  dis¬ 
cuss  why  the  large  grain  model  was  chosen  for  the  LDF  100. 

Small  Grain  Dataflow 

Although  small  grain  dataflow  holds  out  the  promise  of  processing  power  that 
exceeds  the  fastest  super-computers  available  today,  no  sizable  small  grain  dataflow  com¬ 
puter  has  ever  been  built.  The  problems  that  must  be  dealt  with  in  constructing  a  small 
grain  dataflow  computer  include: 

1.  The  massive  parallelism  present  in  small  grain  dataflow  demands  a  very  high 
bandwidth  communication  media  for  the  parallel  processing  elements.  All  of  the 
practical  small  grain  dataflow  machines  proposed  so  far  use  communication  net¬ 
works  to  support  high  bandwidth  interprocessor  communication.  Networks  are 
expensive  to  implement  since  they  require  custom  VLSI  components. 

2.  Small  grain  dataflow  systems  must  be  programmed  in  dataflow'  languages  like  Val. 
ID  and  Lucid.  Although  these  are  elegant  languages,  most  prospective  customers 
for  a  parallel  processor  prefer  Fortran. 

3.  In  many  applications  a  number  of  processors  will  process  a  single  large  data  struc¬ 
ture.  The  organization  of  structure  memory  for  the  storage  of  large  data  structures 
(e.g.,  arrays)  in  dataflow  systems  is  still  a  topic  of  research. 


132 


sivtf 


v>. 


Large  Grain  Dataflow 


In  large  grain  dataflow,  the  dataflow  graph  nodes  are  implemented  by  blocks  of 
code  approximately  the  size  of  procedures.  The  larger  granularity  means  that  the  com¬ 
munication  bandwidth  that  must  be  supported  is  considerably  less  than  that  of  a  small 
grain  dataflow  system.  Since  the  communication  bandwidth  is  smaller,  a  segmented  bus 
architecture  can  be  used  on  the  LDF  100. 

A  program  for  the  LDF  100  has  two  components:  a  data  graph  description  and  a 
set  of  node  implementations.  The  data  graph  is  described  in  a  custom  language 
developed  by  Loral.  The  nodes  of  the  graph  can  be  implemented  in  standard  C  or  FOR¬ 
TRAN.  Although  an  existing  application  must  be  restructured,  existing  code  can  be 
used  to  implement  the  graph  nodes. 

The  Loral  Data  Graph  Language 

A  dataflow  graph  consists  of  a  collection  of  interconnected  nodes.  The  Loral  Data 
Graph  language  describes  the  graph  interconnections.  It  also  gives  a  code  file  implemen¬ 
tation  for  each  node  and  provides  some  simple  type  information  for  the  arcs  of  the 
graph. 

Lymph  =  'bydro.code" 

import{  w»rd(  4  J  volume,  diameter  } 

export{  w°rd[  2  ]  pressure  } 

Monitor  =  ’lalarm" 
import!  w«rd[  2  ]  pressure  } 
export{  word  signal  } 

A  graph  language  node  is  composed  of  four  parts: 

1 .  A  node  header 

2.  An  import  section 

3.  An  export  section 

4.  An  optional  firing  rule 

The  node  header  contains  the  node  name  and  the  name  of  code  file  that  implements 
the  node.  This  name  is  a  UNIX  file  name  and  is  contained  in  double  quotes. 

In  dataflow  all  nodes  must  have  an  import  section  or  they  will  never  be  scheduled 
for  execution.  Nodes  do  not  necessarily  have  to  have  an  export  sections  (for  example,  a 
node  that  feeds  an  output  device  would  have  an  invisible  set  of  exports). 

The  import  and  export  sections  can  contain  multiple  groups  of  arcs.  Each  group  of 
arcs  starts  with  a  type,  followed  by  the  name  of  the  arcs.  The  groups  are  terminated  by 
a  semicolon  or  the  brace  that  ends  the  section. 

The  firing  rule  section  is  optional  and  is  used  only  for  dataflow  nodes  that  imple¬ 
ment  control  structures  like  IF-THEN-ELSE  and  loops. 

Processors  in  the  LDF  100  are  divided  between  chassis.  A  chassis  -  processor  com¬ 
bination  uniquely  defines  a  processor  in  the  LDF  100.  A  program  referred  to  as  the 
Graph  Balancer  assigns  a  processor  and  a  chassis  to  each  node  in  the  dataflow  graph. 


The  graph  from  the  example  above,  is  shown  here,  after  it  has  been  processed  by  the 
Graph  Balancer. 


Lymph  =  ’hydro.code"  <  C0>  <  P2> 
import{  word]  4  j  volume,  diameter  } 
export{  word[  2  )  pressure  } 

Monitor  =  '^larm"  <  C0>  <  P7> 
import{  word}  2  ]  pressure  } 
export{  word  signal  } 


Like  most  dataflow  systems  the  LDF  100  uses  a  tagged  token  architecture.  The 
tokens  that  are  broadcast  on  the  dataflow  bus  are  32  bits  wide.  The  token  is  divided 
into  a  16  bit  data  field  and  a  16  bit  tag  field.  A  program  referred  to  as  the  Tag  Assigner 
assigns  tags  to  the  arcs  of  the  dataflow  graph  and  to  the  code  files  that  implement  the 
nodes.  The  code  files  must  be  tagged  because  they  are  down  loaded  into  the  dataflow 
engine  across  the  dataflow  bus.  The  graph  above  is  shown  below,  after  it  has  been  pro¬ 
cessed  by  the  tag  assigner. 


Lymph  =  'hydro. code"  <  C0>  <  P2>  <  T64479> 

import {  word]  4  )  volume  <  T64383>  ,  diameter  <  T64382>  } 

export{  word[  2  ]  pressure  <  T64381>  } 

Monitor  =  'lalarm"  <  CO;  <  P7>  •  T64399 
import{  word[  2  ]  pressure  <  T64381  } 

export{  w«rd  signal  <  T64380>  } 


Data  Graph  Nodes 

The  Loral  dataflow  programming  environment  is  a  highly  modular  one.  Each  node 
interacts  with  other  nodes  only  through  input  and  output  arcs.  This  modular  design 
allows  code  to  be  easily  reused  in  new  graphs.  Nodes  written  in  different  languages  can 
be  intermixed  without  difficulty  since  all  communication  between  nodes  takes  place  via 
tokens  broadcast  on  the  dataflow  bus. 

In  the  dataflow  environment  a  node  will  not  be  scheduled  for  execution  until  it  has 
all  the  data  it  needs  to  execute.  A  node  reads  the  data  that  arrived  on  its  input  arcs 
with  a  floread  system  call.  Unlike  reads  in  a  CSP  based  system,  fioread  is  not  a  waiting 
read,  since  the  data  is  already  present  when  the  floread  is  executed  (otherwise  the  node 
would  not  have  been  scheduled).  The  C  version  of  the  floread  call  is  shown  below. 

num  tokens  =  fioread(  ARC,  &ptr  ); 


The  floread  call  returns  the  number  of  16  bit  data  values  read.  It  is  passed  the  arc 
to  read  and  a  pointer  to  a  pointer.  The  dataflow  kernel  that  runs  on  the  local  dataflow 
processing  element  will  initialize  this  pointer  with  the  address  of  the  data  that  arrived  on 
the  input  arc. 


Nodes  output  data  using  flowrite  calls.  An  example  of  the  C  version  of  flowrite  is 
shown  below. 

flowrite(  ARC,  ptr,  size  ); 

This  flowrite  call  will  write  size  16  bit  words  from  the  data  structure  pointed  to  by 
ptr  to  the  output  arc  ARC. 

Firing  Rules 

This  paper  provides  only  a  brief  overview  of  the  Loral  Data  Graph  Language.  The 
Data  Graph  Language  also  includes  Tiring  rules,  which  will  not  be  described  in  this 
paper. 

Firing  rules  allow  the  program  to  alter  the  normal  conjunctive  semantics  of  the 
data  graphs  described  so  far.  A  node  that  uses  a  conjunctive  firing  rule  will  only  fire 
when  it  has  all  its  inputs.  To  implement  conditional  flow  of  data  (a  dataflow  control 
structure  similar  to  the  IF  statement)  and  loops,  more  complex  firing  rules  are  needed. 

2.  The  LDF  100  Hardware 

The  LDF  100  logically  consists  of  three  sub-systems: 

1.  The  UNIX  system 

2.  The  dataflow  engine 

3.  The  real  time  I/O  sub-system 

Although  these  components  are  logically  separate,  they  are  physically  packaged  in 
one  or  more  standard  instrument  rack  chassis. 

The  UNIX  System 

The  compilers  for  C  and  Fortran,  and  the  dataflow  graph  processing  tools  needed 
to  creat  a  dataflow  program  are  supported  under  UNIX.  The  LDF  100  uses  the  National 
Semiconductor  GENIX  port  of  UNIX  for  the  National  32000  microprocessor.  This  UNIX 
system  runs  on  a  single  board,  with  1M  byte  of  RAM,  hardware  floating  point  and 
demand  paged  virtual  memory. 

The  Dataflow  Engine 

To  the  programmer  on  the  UNIX  system  the  dataflow  engine  appears  to  be  a  dev¬ 
ice.  This  device  can  be  written  to  and  read  from,  like  any  other  device.  I/O  services  to 
support  reading  and  writing  of  tagged  data  streams  are  supported. 

The  connection  between  UNIX  and  the  dataflow  engine  is  managed  by  a  board 
referred  to  as  the  FLObus  (tm)  Interface  Processor,  or  FIP.  The  FIP  is  an  intelligent 
board  that  has  its  own  NS32000  processor,  local  RAM  and  a  mail  box  ram  that  is  used 
to  communicate  with  UNIX. 

The  Dataflow  Processor 

The  dataflow  processor  is  divided  into  two  sections: 


The  Token  Processing  Section  (TPS) 
The  Node  Processing  Section  (NPS) 


The  Token  Processing  Section  (TPS)  is  connected  to  the  dataflow  bus.  Hardware 
on  the  TPS  fetches  selected  dataflow  tokens  from  the  dataflow  bus  and  places  them  in 
an  input  FIFO.  The  NS32000  microprocessor  in  the  TPS  gathers  the  tokens  from  the 
FIFO  and  collects  them  into  data  sets  referred  to  as  packets.  A  packet  consists  of  one  or 
more  tokens  and  is  the  amount  of  data  needed  to  'Complete"  a  node's  input  arc.  When 
all  of  a  node’s  input  arcs  are  complete,  the  node  can  be  scheduled  for  execution. 

When  all  of  the  data  that  is  needed  to  execute  a  node  has  been  collected,  the  node 
is  scheduled  for  execution  on  the  Node  Processing  Section  (NPS).  Like  the  TPS,  the 
NPS  uses  a  NS32000  microprocessor.  The  NPS  also  includes  a  floating  point  unit,  an 
interrupt  control  unit  and  a  memory  management  unit.  The  NPS  and  TPS  sections  of 
the  node  processor  are  shown  in  the  block  diagram  in  figure  1. 

A  separate  data  fetch  section  (the  NPS)  and  a  separate  application  execution  sec¬ 
tion  (the  TPS)  allows  pipelining  within  the  node  processor.  While  the  NPS  is  executing 
a  node  the  TPS  can  continue  to  collect  data  for  the  remaining  nodes.  Since  data  fetch  is 
a  bottleneck  in  parallel  processors,  this  optimizes  the  performance  of  the  LDF  100. 

Building  Large  Dataflow  Systems 

All  LDF  100  systems  consist  of  a  master  chassis  that  contains  the  UNIX  develop¬ 
ment  environment  and  optional  expansion  chassis.  These  chassis  are  packaged  in  an 
attractive  unit  based  on  a  standard  instrument  rack.  The  master  chassis  can  hold  up  to 
eleven  dataflow  processors  (i.e.,  node  processors).  The  expansion  chassis  can  hold  up  to 
fifteen  processors  each.  Expansion  chassis  can  be  added  to  the  system  until  it  consists  of 
more  than  one  hundred  processors. 

The  chassis  are  connected  via  a  board  referred  to  as  the  FLObus  Network  Interface. 
This  board  supports  the  fast  transfer  of  dataflow  tokens  between  chassis.  This  transfer 
takes  place  via  combinatoric  logic,  so  a  delay  of  only  three  bus  cycles  is  introduced.  The 
FLObus  Network  Interface  (FN1)  is  programmed  to  allows  selected  tokens  to  flow 
between  chassis.  Only  those  tokens  that  cross  between  chassis  on  interchassis  graph  arcs 
are  programmed  into  the  FNI.  so  the  loading  on  the  dataflow  bus  is  minimized.  A 
multi-chassis  LDF  configuration  is  shown  in  figure  2. 
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SISAL:  Initial  MIMD  Performance  Results 
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1  Introduction 


SISAL  (Streams  and  Iteration  in  a  Single-Assignment  Language)  [MSA*85]  is  a  language  for  ex¬ 
pressing  algorithms  to  execute  on  highly  concurrent  computers.  Like  other  functional  languages, 
SISAL  displays  characteristics  allowing  compilers  to  detect  and  easily  exploit  the  underlying  archi¬ 
tectural  parallelism  and  the  potential  for  parallelism  in  the  program. 

Four  groups  associated  with  different  organizations  and  with  different  original  target  architec¬ 
tures  cooperated  to  define  SISAL.  The  organizations  (and  architectures)  are:  Lawrence  Livermore 
National  Laboratory  (Cray  vector  processors),  Digital  Equipment  Corporation  (VAX1  processor 
clusters),  the  University  of  Manchester  (Manchester  data  flow  machine),  and  Colorado  State  Uni¬ 
versity  (Denelcor  HEP  multiprocessor).  SISAL  descends  from  VAL  [AD79,McG82]  and  retains  its 
functional,  single  assignment  character.  SISAL  differs  from  VAL  in  possessing  simpler  error  types, 
general  recursion,  a  stream  data  type,  and  improved  iteration  forms.  The  SISAL  project  has  also 
benefited  from  earlier  work  described  in  [GP79]  and  (KLP79). 

SISAL  offers  several  advantages  for  use  on  a  multiprocessor  system.  The  user  need  not  (and 
cannot)  manage  or  express  parallelism  in  a  program.  The  compiler  detects  parallelism,  decides  how 
much  to  exploit,  and  generates  code  to  take  advantage  of  the  parallelism  and  manage  it.  Users 
express  algorithms  without  expressing  or  implementing  parallelism.  Language  syntax  is  Pascal-like. 
A  familiar  syntax  decreases  the  time  it  takes  to  learn  the  language  and  aids  in  program  readability. 
For  a  SISAL  language  overview  see  [A085]. 

The  SISAL  groups  have  formed  a  pool  of  benchmark  programs  typical  of  those  that  programmers 
would  actually  write  using  SISAL.  Many  are  short,  but  there  are  some  over  1,500  lines  long.  Thus 
they  often  represent  actual  codes  run  in  the  “real  world.” 

At  CSU  we  implemented  SISAL  on  the  Denelcor  HEP  and  evaluated  the  performance  of  the 
initial  implementation.  Work  proceeds  to  port  the  software  to  other  MIMD  systems.  The  sections 
below  briefly  discuss  the  compiler  implementation  with  emphasis  on  portability  considerations, 
describe  the  run  time  support  system,  and  survey  performance  results.  See  [AO86]  for  a  more 
complete  description  of  the  SISAL  implementation  at  CSU. 

'VAX  is  a  trademark  of  Digital  Equipment  Corporation 


Input  Translator  Output 

(1)  SISAL  program  — »  SISAL  parser  — »  IF1 

(2)  IF1  — »  IF1  optimizer  — *  IF1  (optional) 

(3)  IF1  —  IFPCC  generator  —  IFPCC 

(4)  IFPCC  — ►  C  code  generator  — ►  Native  code 

Figure  1:  SISAL  Programming  System  at  CSU 

2  Compiler 

The  SISAL  compiler,  designed  for  flexibility  and  portability,  consists  of  several  phases.  We  use  a 
parser  that  produces  a  machine-independent  intermediate  form,  “IFl”  [SG85],  along  with  optimiz¬ 
ers  that  improve  this  intermediate  form  [SG85j;  both  were  developed  at  LLNL.  The  next  phase 
[CA084a]  translates  to  another  intermediate  form  acceptable  to  the  second  pass  (code  generation 
and  optimization)  of  the  portable  C  compiler  [Joh78],  used  almost  universally  in  UNIX  1  systems. 
This  approach  has  saved  a  great  deal  of  effort  and  promises  to  ease  the  porting  of  the  translator 
system  to  other  parallel  processors  that  support  C,  or  facilitate  its  use  as  a  cross-compiler.  A 
diagram  of  the  CSU  SISAL  compiler  appears  in  Figure  I. 

3  Run  Time  System 

Most  SISAL  run  time  software  is  written  in  C,  with  a  few  assembly  language  routines.  This  aids 
in  system  portability  to  other  machines  running  UNIX.  The  major  responsibilities  of  the  run  time 
routines  are  managing  processes,  arrays,  streams,  and  dynamic  storage. 

Parallel  execution  units  in  MIMD  SISAL  are  function  bodies,  parallel  loop  “stripes,”  and  multi¬ 
expression  components.  Execution  of  parallel  loops  and  multi-expression  components  is  performed 
by  daemon  processes  that  interact  closely  with  SISAL  process  management  in  low-overhead  ways. 
(For  these  tests,  this  had  to  be  emulated  by  packaging  loop  slices  as  explicit  functions;  the  automatic 
capability  is  now  fully  implemented.)  So  mapping  SISAL  processes  onto  the  hardware  resources 
is  the  major  concern.  If  each  function  reference  occupied  a  processor,  then  either  the  number  of 
available  processors  would  be  overrun  or  deadlock  would  result  as  parent  processes  held  processors 
waiting  for  completion  of  child  functions  who  cannot  execute.  We  prevent  deadlock  by  requiring  that 
a  parent  process  waiting  for  value(s)  from  children  relinquish  its  processor  and  suspend  execution. 
See  [BA084]  and  [VA085]  for  more  complete  discussions  of  process  management. 

Multi-dimensional  arrays  in  SISAL  are  “arrays  of  arrays”  and  are  dynamic.  At  compile  time 
we  know  only  the  dimensionality  and  element  type;  bounds  are  in  general  determined  only  at  run 
time.  Necessary  resulting  run  time  tasks  include  adding  an  element  to  an  array,  fetching  a  value 
from  an  array,  etc.  [CA084b]  has  a  complete  discussion  on  array  implementation. 

3 UNIX  is  a  trademark  of  AT&T  Bell  Laboratories. 


Streams  are  an  important  data  structure  because  they  provide  pipelined  parallelism  among 
stream  producer  and  consumer  functions.  Both  the  stream  producer  and  consumers  execute  si¬ 
multaneously  so  the  program  processes  can  form  generalized  parallel  pipelines.  Only  a  contiguous 
substream  of  the  stream  needs  to  be  extant  at  any  time  because  values  that  the  slowest  stream 
consumer  no  longer  will  reference  need  not  remain.  In  this  way  programs  can  execute  that  produce 
and  consume  (finite  prefixes  of)  infinite  streams;  these  programs  terminate  normally.  For  a  further 
discussion  see  [A083]. 

SISAL  run  time  support  relies  on  dynamic  storage  allocation  for  process  descriptors,  streams, 
and  arrays.  An  adapted  boundary  tag  method  is  in  place  that  allows  multiple  concurrent  allocations 
and  deallocations  [BA085].  Also,  we  include  a  “front  end”  caching  scheme  that  can  greatly  speed  up 
dynamic  storage  management  under  the  conditions  that  obtain  during  SISAL  execution  [OA85a]. 
For  a  complete  discussion  of  dynamic  storage  management  in  SISAL,  as  well  as  other  run  time 
support  details,  see  [OA85b]. 

Note  that  in  these  tests  array  storage  is  not  properly  recycled,  adversely  affecting  performance. 
We  anticipate  that  a  new  intermediate  form  under  development  at  LLNL  will  help  all  SISAL 
implementations  to  optimize  storage  management. 

4  Performance 

We  briefly  describe  the  performance  of  a  few  benchmark  programs  as  measured  on  a  single-PEM 
Denelcor  HEP  processor;  unfortunately  this  machine  is  no  longer  available.  Because  of  the  im¬ 
plementation  restrictions  mentioned  in  the  previous  section,  we  believe  these  results  represent  a 
baseline  for  enhanced  performance  on  other  shared-memory  systems.  Overall,  the  speedup  curves 
display  performance  similar  to  those  of  HEP  programs  written  in  imperative  languages  with  ex¬ 
plicit,  low-level  process  synchronization  mechanisms.  Further,  these  curves  follow  the  general  form 
predicted  for  this  architecture  in  [Jor85], 

The  performance  of  a  simple  numeric  integration  program  is  seen  in  Figure  2.  The  speedup 
curve  is  typical  of  programs  that  operate  on  scalars.  Process  management  among  SISAL  functions 
that  exchange  simple  values  is  being  tested  here. 

Figure  3  shows  the  speed  up  curve  for  a  matrix  multiplication  program.  The  sawtooth  effect 
seen  in  the  speedup  curve  comes  from  varying  the  number  of  hardware  resources  against  a  fixed 
size  for  the  data.  The  periodic  degradations  are  due  to  the  lack  of  array  space  recycling. 

A  noise  filtering  program  was  coded  in  SISAL;  results  are  displayed  in  Figure  4.  The  sawtooth 
effect  of  the  previous  benchmark  is  not  as  pronounced  because  storage  interference  is  not  a  factor. 

Finally,  the  performance  of  a  sieve  method  for  prime  number  generation  is  given  in  Figure 
5.  This  graph  shows  the  best  speedup  among  the  four  for  small  numbers  of  hardware  processes. 
Beyond  13  hardware  processes  there  is  not  enough  work  to  keep  SISAL  functions  running  and  offset 
the  cost  of  frequently  idling  hardware  processes;  performance  actually  degrades  The  smoothness 
of  the  curve  shows  that  stream  run  time  support  operates  in  a  predictable  fashion. 
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Figure  4:  Noise  Removal  Performance  Results 
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Figure  5:  Sieve  of  Eratosthneses  Performance  Results 
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Abstract 

One  of  the  significant  problems  which  must  be  addressed  if  we  are  to  realize  the  comput¬ 
ing  potential  offered  by  parallel  architectures  has  to  do  with  developing  a  better  understanding 
of  the  relationship  between  parallel  algorithms  and  parallel  architectures.  In  this  paper, 
research  on  the  mapping  of  algorithms  to  reconfigurable  parallel  architectures  is  presented. 
The  thrust  of  this  work  is  in  identifying  those  characteristics  of  parallel  algorithms  which  have 
the  greatest  effect  on  their  execution,  and  in  identifying  a  correspondence  between  those 
characteristics  and  the  characteristics  of  parallel  architectures.  The  context  of  this  work  is  in 
the  design  of  an  Intelligent  Operating  System  for  the  PASM  reconfigurable  multimicroprocessor 
system.  The  task  of  the  Intelligent  Operating  System  will  be  to  direct  the  selection  and 
scheduling  of  algorithms  and  the  configuring  of  the  architecture  for  the  execution  of  an  image 
understanding  system. 


1.  Introduction 

In  both  parallel  architectures  and  parallel  algorithms,  there  exist  many  design  choices  for 
which  there  are  no  direct  counterparts  in  conventional  serial  processing.  In  architectures, 
examples  of  such  choices  include  number  of  processors,  local  versus  global  memory  organiza¬ 
tion,  mode  of  processing  (synchronous  or  asynchronous),  and  interconnection  topology  In 
parallel  algorithms,  issues  which  do  not  arise  in  serial  programming  include  determination  of 
the  number  of  processors  needed/useful  for  a  task,  data  allocation  across  memories,  synchroni¬ 
zation  necessary/beneficial,  and  interprocessor  communication  requirements.  The  move  to 
parallelism  has  introduced  new  degrees  of  freedom  to  both  the  architecture  and  algorithm 
design  process.  For  effective  use  of  parallel  systems,  it  is  essential  to  obtain  a  good  match 
between  algorithm  requirements  and  architecture  capabilities. 

The  question  of  mapping  parallel  algorithms  to  parallel  architectures  has  importance  at 
three  levels.  First,  it  bears  directly  on  the  algorithm  design  process.  General  knowledge  about 
what  constitutes  an  effective  match  between  a  parallel  algorithm  and  a  parallel  architecture 
can  accelerate  the  process  of  developing  new  parallel  algorithms  for  a  given  machine.  Second, 
an  understanding  of  the  relation  between  algorithms  and  architectures  is  a  prerequisite  for  the 
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computing  Research  Center  under  Contract  MDA904-85-5027. 


fast,  efficient  design  of  algorithmically-specialized  systems  [SnJ85].  Given  a  fixed  set  of  algo¬ 
rithms,  architectures  tailored  for  the  execution  of  those  algorithms  can  be  developed  if  the 
architectural  requirements  of  the  algorithms  are  understood.  Third,  a  general  method  of  relat¬ 
ing  algorithms  and  architectures  will  allow  efficient  use  of  flexible  parallel  systems  such  as 
PASM  [SiS84].  Many  problems  in  image  understanding  require  execution  of  a  large  number  of 
algorithms,  with  the  exact  choice  and  execution  sequence  of  the  algorithms  varying  depending 
on  the  input  data.  Critical  timing  requirements  may  accompany  such  scenarios. 
Reconfigurable  parallel  systems  can  provide  an  environment  which  can  be  used  to  test  the 
ability  of  various  architectures  --  either  sequences  of  simulated  special-purpose  parallel  systems 
or  successive  configurations  of  the  reconfigurable  system  itself  --  to  meet  these  computing 
needs.  Integral  to  the  effective  use  of  these  flexible  parallel  systems,  however,  will  be  the  abil¬ 
ity  to  select  machine  configurations  based  on  knowledge  about  the  algorithms  to  be  executed. 
In  order  to  accomplish  this  automatically,  the  operating  system  will  need  to  use  information 
about  the  characteristics  of  the  algorithms  to  select  successive  configurations  of  the  parallel 
architecture. 

In  [DeS85],  we  have  presented  a  model  for  an  Intelligent  Operating  System  for  the  execu¬ 
tion  of  image  understanding  tasks  on  PASM  (see  Fig.  1).  In  this  model,  the  Algorithm  Data¬ 
base  contains  the  algorithms  which  can  be  used  in  performing  the  image  understanding  task, 
plus  two  types  of  information:  properties  of  the  algorithm  which  describe  its  image  analysis 
capabilities  (e.g.,  “edge  finding  in  noisy  images”)  and  properties  which  describe  the  execution 
characteristics  of  the  algorithm  on  different  architecture  configurations.  The  Algorithm  Data¬ 
base  may  contain  multiple  algorithms  to  perform  the  same  basic  task  (e.g.,  edge  finding),  with 
different  versions  providing  different  image  processing  capabilities  (e.g.,  good  performance  in 
the  presence  of  noise)  and/or  having  different  parallel  implementations  (e.g.,  based  on  raster 
versus  square  subimage  allocation  of  the  image  to  the  parallel  memories).  The  Intelligent 
Image  Understanding  System  determines  what  types  of  symbolic  and  iconic  operations  should 
be  performed,  selects  the  algorithms  to  be  executed,  and  uses  the  results  from  these  operations 
to  determine  what  needs  to  be  done  next.  The  Reconfigurable  Parallel  Processing  System  in 
our  study  is  PASM,  which  can  be  dynamically  reconfigured  under  software  control  to  operate 
as  one  or  more  independent  partitions  (“virtual  machines”)  of  various  sizes.  Each  partition 
can  operate  in  either  SIMD  or  MIMD  mode,  and  can  dynamically  switch  modes  under  software 
control.  The  Low-Level  Operating  System  Routines  accomplish  the  actual  reconfiguration. 
The  Intelligent  Operating  System  uses  information  from  the  Intelligent  Image  Understanding 
System,  Algorithm  Database,  and  knowledge  about  the  current  system  state  to  select  a  specific 
algorithm  implementation  and  machine  configuration  for  accomplishing  the  next  step  of  the 
image  understanding  task.  The  Intelligent  Image  Understanding  System  and  the  Intelligent 
Operating  System  are  to  be  implemented  as  expert  systems. 

Various  researchers  have  examined  the  problem  of  characterizing  parallel  architectures 
(e.g.,  see  (Kun80,  EtN83]),  characterizing  parallel  algorithms  (e.g.,  see  [DaL81,  H0J8I,  SmS85), 
and  relating  algorithms  to  architectures  (e.g.,  see  [CaL82,  ChF83]).  In  this  paper,  we  consider 
the  problem  of  identifying  the  characteristics  of  parallel  algorithms  which  have  the  greatest 
effect  on  their  execution,  and  of  identifying  a  correspondence  between  those  characteristics  and 
the  characteristics  of  parallel  architectures.  This  information  is  to  be  used  in  constructing  the 
Algorithm  Database  in  the  above  model. 


Fig.  1.  System  model. 

2.  Architecture  Model 

We  assume  a  target  parallel  architecture  with  the  general  attributes  listed  below.  The 
purpose  of  the  assumptions  is  to  define  a  very  general  architecture  framework,  so  that  no 
undesirable  restrictions  are  imposed  on  the  architecture  by  the  assumptions.  Rather,  by  exam¬ 
ination  of  the  algorithm  attributes,  the  subset  of  capabilities  required  in  the  architecture  can 
be  selected.  Thus  the  framework  is  more  general  than  most  existing  parallel  systems,  but 
many  architectures,  including  PASM  [SiS84j,  Ultracomputer  [Got82],  RP3  [Pfi85],  the  Cosmic 
Cube  [Sei85],  the  Butterfly  [Cro85],  the  Connection  Machine  [Hil85],  MPP  [Bat80],  Clip  4 
[Duf82],  and  pyramid  architectures  [TaK80,  Uhr83],  can  be  characterized  in  terms  of  subsets  of 
these  design  attributes. 

Assumptions  for  the  target  architecture  are  as  follows: 

•  The  system  consists  of  a  large  number  of  homogeneous  processors.  In  mapping  the  algo¬ 
rithm  to  an  ideal  parallel  architecture,  the  number  of  processors  available  is  considered 
to  be  unlimited. 

•  The  system  can  be  organized  with  processors  accessing  a  shared  global  memory  or  with 
each  processor  having  an  associated  local  memory,  or  a  hybrid  of  the  two  approaches. 

•  The  system  is  partitionable  into  independent  submachines  of  various  sizes.  This  implies 
that  (1)  a  number  of  algorithms  can  be  executing  simultaneously,  on  different  partitions 
of  the  system,  and  (2)  for  each  algorithm,  the  partition  size  can  be  selected  to  meet  the 
needs  of  the  algorithm.  The  partitioning  can  be  changed  dynamically  at  execution  time. 

•  Each  partition  of  the  system  (and  the  entire  system  itself)  is  capable  of  both  SIMD  and 
MIMD  operation,  and  can  dynamically  switch  between  modes  during  execution. 

•  The  system  has  a  flexible  interconnection  network  which  can  provide  a  wide  variety  of 
communications  patterns  within  each  partition. 

Given  the  above  assumptions,  the  problem  becomes  one  of  selecting  the  architecture 
configuration  —  memory  organization,  partition  size,  mode  of  operation,  network  configuration 
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—  which  best  matches  the  attributes  of  the  algorithm. 


3.  Algorithm  Characteristics 

Candidate  algorithm  characteristics  are  listed  below.  Table  1  summarizes  the  relationship 

between  algorithm  and  architecture  characteristics.  A  “1”  entry  in  the  table  indicates  a  prob¬ 
able  primary  dependence  between  the  algorithm  and  architecture  characteristics,  an  entry  of 

“2”  indicates  a  likely  secondary  dependence  and  “3”  denotes  a  less  strong  dependence. 

1.  Nature  of  the  parallelism.  Under  this  category  come  a  number  of  attributes  having  to  do 
with  the  “kind”  of  parallelism  which  is  used  and  the  the  way  in  which  the  algorithm 
and/or  data  can  be  decomposed. 

a.  Data  parallelism  versus  functional  parallelism.  Parallelism  can  be  achieved  by  dividing 
the  data  among  the  processors,  by  decomposing  the  algorithm  into  segments  which 
can  be  assigned  to  different  processors,  or  by  macropipelining  (which  is  a  special  case 
of  decomposition  of  the  algorithm  into  functional  segments).  The  type  of  parallelism 
will  affect  the  allocation  of  data,  the  assignment  of  processes  to  processors,  and  the 
basic  decision  as  to  what  mode  of  parallelism  (SIMD/MIMD/macropipeline)  to  use. 

b.  Data  granularity.  Data  granularity  deals  with  the  “size"  of  the  data  items  processed 
as  a  fundamental  unit,  and  will  have  a  bearing  on  the  data  allocation,  communica¬ 
tions  requirements,  processor  capability,  and  memory  requirements. 

c.  Module  granularity.  Module  granularity  [Kun80)  quantifies  the  amount  of  processing 
which  can  be  done  independently,  either  of  other  processes  or  of  operations  being  per¬ 
formed  in  other  processors.  It  is  essentially  a  measure  of  the  frequency  of  synchroni¬ 
zation,  and  will  affect  the  choice  of  SIMD  versus  MIMD  operation,  the  assignment  of 
processes  to  processors,  the  communications  requirements,  and  the  likelihood  of  equal¬ 
izing  the  execution  times  of  component  parts  of  the  algorithm. 

2.  Degree  of  parallelism.  This  will  be  related  to  both  the  data  granularity  and  the  module 
granularity.  Its  most  direct  impact  will  be  on  the  choice  of  machine  size  and  on  the  max¬ 
imum  speedup  attainable. 

3.  Uniformity  of  the  operations.  If  the  operations  to  be  performed  are  uniform  (e.g.,  across 
the  data  or  feature  set),  then  SIMD  (or  pipeline)  processing  may  be  feasible.  If  the  opera¬ 
tions  are  not  uniform,  then  MIMD  processing  will  be  chosen  and  strategies  to  equalize  the 
computational  load  across  the  processors  may  come  into  play.  These  strategies  may  be 
applied  statically  at  compile  time  or  dynamically  at  execution  time. 

4.  Synchronization  requirements.  In  addition  to  the  synchronization  requirements  implied  by 
the  process  granularity,  consideration  of  precedence  constraints  is  implicit  in  characteriz¬ 
ing  the  synchronization  requirements.  This  will  affect  the  assignment  of  processes  to  pro¬ 
cessors  and  the  scheduling  of  various  components  of  the  algorithm. 

5.  Static j dynamic  character  of  the  algorithm.  The  pattern  of  process  generation  and  termina¬ 
tion  will  affect  the  processor  utilization,  the  scheduling  of  sub- processes,  the  memory 
organization,  and  the  communications  requirements. 

6.  Data  dependencies.  The  data  dependencies  in  an  algorithm  will  play  the  largest  role  in 
dictating  data  allocation  patterns  and  communications  characteristics.  They  will  also  have 
a  major  part  in  the  decision  to  use  a  global  versus  local  memory  organization. 

7.  Fundamental  operations.  The  basic  operations  performed  in  the  algorithm  will  dictate  the 
processor  capabilities  needed. 
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Algorithm  Characteristics 


Addressing  Modes 


8.  Data  types  and  precision.  The  atomic  data  types  and  data  precision  will  bear  most 
directly  on  the  individual  processor  capability  and  on  the  memory  requirements,  but  may 
also  imply  requirements  for  communications  bandwidth. 

9.  Data  structures.  Many  algorithms  can  be  characterized  as  having  a  “natural”  data  struc¬ 
ture  (or  structures)  on  which  operations  are  performed.  The  ability  of  an  architecture  to 
support  the  needed  access  patterns,  to  exploit  possible  regularity  in  the  structures,  and  to 
allow  the  needed  interactions  between  parts  of  the  structures  will  affect  algorithm  perfor¬ 
mance. 

4.  A  Model  for  Verifying  the  Algorithm  Characteristics  Set 

Using  the  model  from  [DeS85]  as  a  basis  (Fig.  1),  we  present  a  model  for  verifying  the 
usefulness  of  the  algorithm  characteristics  set  for  mapping  algorithms  to  architectures.  The 
system  will  include  five  components,  as  shown  in  Fig.  2:  an  Algorithm  Characteristics  Data¬ 
base,  an  Architecture  Configurations  Database,  a  Reconfigurable  Parallel  Processing  System 
(i.e.,  the  actual  architecture),  a  set  of  Low-level  Operating  System  Routines,  and  the 
Configuration  Selector  itself. 

•  The  Algorithm  Characteristics  Database  will  contain  a  compendium  of  information  about 
the  characteristics  of  parallel  algorithms  and  their  relations  to  the  various  aspects  of  paral¬ 
lel  architectures. 

•  The  Architecture  Configurations  Database  will  contain  the  parameters  which  describe  the 
underlying  Reconfigurable  Parallel  Processing  System.  This  will  include  information  about 
the  number  of  processors  in  the  system,  possible  memory  organizations  which  can  be 
selected,  modes  of  operation  which  are  available,  and  a  description  of  the  interconnection 
network.  In  the  most  general  case,  this  will  correspond  to  the  architecture  assumptions 
outlined  in  Section  2.  When  running  the  operating  system  on  a  particular  architecture,  it 
will  contain  detailed  information  afcout  that  architecture. 

•  Our  Reconfigurable  Parallel  Processing  System  will  be  PASM.  PASM  satisfies  the  architec¬ 
ture  assumptions  outlined  in  Section  2  above,  and  is  therefore  a  suitable  target  architecture. 
The  availability  of  the  PASM  simulator  and  prototype  will  allow  us  to  test  our 
architecture-dependent  algorithms  on  the  machine  configurations  selected  by  the  operating 
system. 

•  The  Low-level  Operating  System  Routines  will  be  used  to  do  the  actual  system 
configuration,  and  are  currently  under  development  for  PASM. 

•  The  Configuration  Selector  will  select  the  machine  configuration  given  the  characteristics  of 
the  parallel  algorithm  and  the  parameters  of  the  Reconfigurable  Parallel  Processing  System. 
The  function  of  the  Configuration  Selector  will  be  to  relate  the  characteristics  of  a  parallel 
algorithm  to  possible  architecture  configurations,  and  to  combine  the  demands  of  the  vari¬ 
ous  characteristics  into  a  single  decision  about  the  machine  configuration.  Furthermore,  the 
Configuration  Selector  will  have  to  determine  the  assignment  of  resources  to  multiple  algo¬ 
rithms  which  can  be  executed  simultaneously.  T'  o  Configuration  Selector  will  employ  the 
information  in  the  Algorithm  Characteristics  Database  and  the  Architecture  Configurations 
Database.  Possible  tools  for  implementation  of  this  component  include  expert  systems  and 
rule  sets.  (An  obvious  question  arises  as  to  how  the  characteristics  of  a  new  algorithm  are 
obtained.  A  long  term  facet  of  this  research  will  involve  studying  to  what  extent  the 
characteristics  can  be  extracted  automatically  from  a  high-level-language  representation  of 
the  algorithm  (e.g,  by  a  compiler).  For  the  purposes  of  demonstrating  the  algorithm-to- 
architecture  mapping,  the  more  immediate  goal  is  the  development  of  a  user  interface 


Fig.  2.  Organization  of  the  Operating  System  Component 
for  Mapping  Algorithms  to  Architectures. 

which  can  help  the  user  frame  the  algorithm  in  terms  of  the  characteristics  needed  to  allow 
the  operating  system  to  select  an  architecture  configuration.) 

The  combination  of  the  Algorithm  Characteristics  Database,  the  Architecture 
Configurations  Database,  and  the  Configuration  Selector  will  perform  the  automatic  selection 
of  a  machine  configuration  suitable  for  execution  of  the  given  algorithm.  The  Reconfigurable 
Parallel  Processing  System,  the  Low-level  Operating  System,  and  the  Architecture 
Configurations  Database  are  machine-dependent  components  of  the  system.  The  Algorithm 
Characteristics  Database  and  the  Configuration  Selector  components,  however,  are  machine- 
independent.  Thus,  although  we  are  using  a  specific  system  as  a  vehicle  for  demonstrating  our 
work,  the  knowledge  about  the  mapping  of  parallel  algorithms  to  parallel  architectures  will  be 
general  in  nature,  and  will  be  applicable  to  a  broad  class  of  systems. 
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A  Distributed  System  Architecture  Based  on  Macro  Dataflow  Model 

by 

Jane  W.  S.  Liu  and  Andrew  Grimshaw 

I.  Introduction 

Increasingly  in  recent  years,  distributed  systems  containing  interconnected  workstations  and 
host  computers  are  used  to  provide  a  wide  range  of  services  in  word  processing  and  text  formating, 
computing,  and  information  management  and  distribution.  A  user  on  such  a  system  typically 
executes  his  jobs  on  his  own  workstations  or  local  host.  As  a  result,  many  workstations  are 
frequently  idle  and  host  computers  are  under  utilized.  Redundant  resources  naturally  available  in 
such  a  system  are  often  not  effectively  used  to  improve  the  reliability  and  availability  of  services 
provided  by  the  system.  (Since  it  is  not  necessary  to  distinguish  between  them,  we  refer  to  both 
workstations  and  host  computers  as  hosts.) 

There  are  two  approaches  to  capture  and  use  the  capability  of  idle  or  under-utilized  hosts  in 
a  distributed  system.  In  one  approach,  heavily  loaded  hosts  send  new  jobs  and  migrate  executing 
jobs  to  lightly  loaded  hosts  for  execution,  i.e.,  hosts  do  load  balancing.  Many  load  balancing 
schemes  have  been  proposed  to  improve  response  time  and  resource  utilization  [1-9].  LOCUS, 
Accent,  V-System,  and  DEMOS/MP  are  examples  of  recent  distributed  systems  that  support  some 
forms  of  of  load  balancing  [10-13].  When  inter-host  communication  cost  is  sufficiently  low, 
response  time  and  resource  utilization  of  a  distributed  system  can  be  improved  further  by 
partitioning  jobs  into  tasks  and  assigning  tasks  to  different  hosts  to  be  executed  concurrently 
whenever  possible.  This  approach  is  the  same  as  the  multitasking  approach  taken  to  increase 
parallelism  in  tightly-coupled  multiprocessor  systems.  It  is  made  feasible  recently  by  the  advent 
of  extremely  high-speed,  local-area  networks.  Several  optimal  and  suboptimal  task  assignment 
algorithms  have  been  designed  to  find  assignments  of  tasks  to  hosts  in  distributed  systems  [14,15]. 
These  algorithms  take  into  account  inter-task  communication  times,  which  are  often  large 
compared  with  task  execution  times  when  tasks  belonging  to  the  same  job  are  assigned  to  different 
hosts. 

Load  balancing  and  distributed  task  assignment  schemes  essentially  extend  the  scheduling 
and  resource  management  methods  used  in  single  host  systems  to  support  job  sequencing  in 
distributed  environments.  Typically,  these  schemes  require  the  programmer  to  break  up  jobs  into 
tasks  and  explicitly  handle  inter-task  communication.  Complexities  of  load  balancing  and  task 
assignment  algorithms  grow  with  the  number  of  tasks.  These  factors  make  it  difficult  to  achieve  a 
high  degree  of  parallelism.  When  the  applications  supported  by  the  distributed  system  require 
high  reliability  and  availability,  the  need  for  fault  tolerance  and  robustness  makes  load  balancing 
and  task  assignment  even  more  complex. 

The  problem  of  providing  efficient,  reliable  services  in  distributed  environments  on  highly 
available  basis  is  made  difficult  partially  by  the  fact  that  distributed  systems  are  typically  “Von 
Neumann  machines.”  In  such  a  system,  executions  of  user  processes  are  carried  out  under  the 
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supervision  of  a  system-wide  controller  that  sequences  user  processes.  To  ensure  that  the 
controller  is  reliable  and  highly  available,  it  is  often  necessary  to  distribute  and/or  replicate  its 
functional  modules  on  more  than  one  host.  The  need  to  coordinate  the  operations  of  a  distributed 
controller  on  different  hosts  and  to  reconfigure  its  structure  in  the  presence  of  load  fluctuations  and 
failures  is  one  of  the  causes  of  inefficiency.  The  large  amount  of  state  information  that  must  be 
gathered  and  maintained  by  the  controller  to  support  its  decisions  is  another  cause. 

One  approach  used  to  increase  parallelism  in  a  natural  and  robust  manner  in  tightly-coupled 
multiprocessor  environments  is  to  use  the  dataflow  architecture  [16, 17] .  This  paper  proposes  the 
use  of  this  approach  to  increase  parallelism,  and  to  improve  fault  tolerance  and  availability  in 
loosely-coupled  distributed  environments.  A  macro  dataflow  model  is  developed  to  model 
distributed  computation.  The  macro  dataflow  model  is  based  on  the  dataflow  model,  with  features 
added  to  model  interprocess  communication  and  synchronization.  A  high-level,  macro-dataflow, 
distributed  system  architecture  based  on  this  model  is  proposed.  As  in  a  dataflow  machine, 
granules  of  computation  in  a  macro-dataflow  system  are  performed  as  soon  as  all  necessary  input 
are  available.  A  high  degree  of  parallelism  can  be  achieved  naturally  without  the  need  of  a 
complex,  distributed  controller.  Many  fault  tolerance  techniques  can  be  used  in  an  integrated 
manner  to  ensure  fault  tolerance  and  to  improve  system  availability.  An  object-oriented  approach 
to  software  design  and  implementation  can  be  used  to  support  the  development  of  macro  dataflow 
programs. 

The  rest  of  the  paper  is  organized  as  follows.  Section  II  describes  briefly  the  well-known 
dataflow  model  and  motivates  the  macro  dataflow  model.  The  elements  of  macro  dataflow  model 
are  presented  in  Section  III.  Section  IV  proposes  an  architecture  of  distributed,  macro-dataflow 
systems  for  executing  macro-dataflow  programs.  The  advantages  of  macro-dataflow  distributed 
system  are  also  discussed.  Section  V  is  a  summary. 

II.  Background  and  Motivation 

In  the  well-known  dataflow  model,  there  are  two  types  of  objects:  tokens  and  actors.  Tokens 
carry  data  or  control  information.  Each  actor  performs  a  function  based  on  the  information 
contained  in  the  tokens  it  consumes.  Actors  are  computation  primitives,  corresponding  to  granules 
of  computation.  In  most  cases,  granules  of  computation  are  machine  instructions  [16,17].  Actors 
do  not  have  any  internal  state  preserved  from  one  computation  to  the  next. 

A  computation  can  be  described  by  a  dataflow  graph  in  which  nodes  are  actors.  Actors  are 
connected  by  directed  arcs  along  which  tokens  flow.  Arcs  model  data  dependencies  between 
granules  of  computation.  Specifically,  the  execution  of  any  granule  cannot  begin  until  all  granules 
of  computation  on  which  it  depends  have  completed.  Correspondingly,  in  the  model,  an  actor  is 
enabled  and  may  "fire"  (execute)  only  when  there  are  tokens  on  all  of  its  incoming  arcs. 
Parallelism  is  gained  in  dataflow  architectures  by  allowing  any  actor  to  execute  on  any  processor, 
and  by  allowing  as  many  enabled  actors  to  fire  as  there  are  processors  to  execute  them.  When 
there  is  a  sufficiently  large  number  of  processors,  only  actors  that  depend  on  uncompleted  granules 
of  computation  are  not  enabled. 

One  disadvantage  of  the  dataflow  approach  is  that  the  amount  of  inter  processor 
communication  can  be  quite  high,  since  each  computation  can  require  many  tokens  to  be  sent.  In 
most  distributed  systems  where  inter-host  communication  is  costly,  the  advantage  in  the  high 
degree  of  parallelism  gained  by  using  this  approach  is  lost  unless  lower  resolution  is  used  17] .  In 
other  words,  instead  of  individual  machine  instructions,  more  complex,  high-level  functions  should 
be  chosen  as  computation  primitives.  To  illustrate  this  point,  let  us  consider  the  dataflow  graph 
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shown  in  Figure  1.  The  program  represented  by  this  graph  generates  the  sum  l2  +  22  +  ....  +  n2. 
The  dataflow  graph  consists  of  three  subgraphs.  The  first  subgraph,  consisting  of  actors  D,  E,  and 
F,  is  a  counter.  This  subgraph  generates  a  token  for  each  integer  from  1  to  n  and  sends  the  token 
to  the  second  subgraph,  consisting  of  the  actor  C.  For  each  integer  les9  than  n,  the  first  subgraph 
also  generates  a  control  token  with  the  value  of  true.  A  false  token  is  generated  when  the  value  n 
has  been  reached.  The  control  tokens  are  sent  to  the  third  subgraph,  consisting  of  actors  A  and  B. 
The  second  subgraph  receives  inputs  tokens  containing  integers  from  1  to  n  and  computes  the 
squares  of  these  integers.  It  passes  the  results  on  to  the  third  subgraph  that  computes  the  sum  of 
the  squares.  The  sequential  imnlementation  of  the  algorithm  implemented  by  this  dataflow 
program  is  shown  below. 


LOAD 

X,0 

Load  0  in  register  X 

LOAD 

Z,0 

Load  0  in  register  Z 

LOOP: 

INC 

X 

Increment  X 

MOV 

YfX 

Move  contents  of  X  to  Y 

MULPY 

Y,Y 

Multiple  Y  to  Y 

ADD 

Z,Y 

Add  the  content  of  Y  to  content  of  Z 

CMP 

X,n 

Compare  the  content  of  X  with  n 

DONE: 

JNE 

LOOP 

Jump  to  LOOP  if  X  is  less  than  n 

Suppose  that  each  instruction  takes  one  time  unit.  When  communication  time  is  negligible,  the 
sum  of  n  squares  is  obtained  in  2n+3  time  units  for  the  dataflow  implementation,  and  6n+2  time 
units  for  the  sequential  implementation.  lOn  tokens  are  passed  between  actors  for  the  dataflow 
implementation  while  only  two  messages  are  required,  one  to  send  the  input  n  and  one  to  output 
the  result,  for  the  sequential  implementation.  Let  t  be  the  amount  of  time  required  to  send  a 
message  or  a  token.  Let  Cd  be  and  C3  be  the  completion  time  for  the  generation  of  the  sum  using 
the  dataflow  implementation  and  the  sequential  implementation,  respectively.  Assuming  that  any 
enabled  actor  fires  immediately,  and  that  all  communication  can  be  performed  in  parallel,  we  have 
that  Cd  is  equal  to  n{2t  +2)+2t  +3  and  C,  is  equal  to  6n+2t.  For  rz  —10,  Cd  becomes  larger  than 
Ct  when  t  becomes  larger  than  2  time  units.  In  distributed  systems,  t  being  larger  than  2  is  almost 
always  true. 


Figure  1.  An  Example  of  Dataflow  Implemen  .,.ion 


We  refer  to  the  average  number  of  messages  or  tokens  sent  per  computation  primitive 
executed  in  a  computation  as  the  communication  ratio  of  the  computation.  The  larger  the 
communication  ratio  of  a  computation  is,  the  faster  its  completion  time  grows  with  the  time 
required  to  send  a  message  or  token.  Hence,  to  adapt  the  dataflow  approach  to  a  loosely  -coupled 
system,  where  the  time  required  to  send  a  message  or  token  is  typically  in  the  order  of  tens  or 
hundreds  of  machines  instruction  time,  it  is  necessary  to  reduce  the  communication  ratio  of 
computations.  This  can  be  done  by  increasing  the  power  and  complexity  of  the  computation 
primitives,  i.e.,  by  increasing  the  size  of  the  granules  of  computation. 

HI.  Macro  Data  Flow  Model 

In  the  macro  dataflow  model  of  distributed  computations,  macro  actors  perform  complex, 
high-level  functions  instead  of  individual  machine  instructions.  For  example,  in  a  distributed 
relational  database  system,  one  may  choose  to  let  computation  primitives  be  operations  such  as 
database-read,  database-write,  project,  select,  join,  etc.,  required  to  process  a  transaction. 
Similarly,  in  a  electronic  message  system,  jobs  are  requests  to  send  and  receive  messages.  Macro 
actors  may  be  edit-file,  encrypt-file,  send-file,  write-mailbox,  etc.. 

m.l.  Regular  Actors,  Tokens,  and  Program  Graphs 

Some  of  the  macro  actors  are  regular  actors.  They  are  the  same  as  actors  in  the  dataflow 
model.  (Unless  stated  otherwise,  we  use  the  term  actors  to  mean  macro  actors  in  subsequent 
discussion.)  (Regular  macro)  actors  have  the  following  characteristics: 

(1)  All  actors  of  a  given  type  are  equivalent. 

(2)  Each  type  of  actors  requires  a  fixed  number  of  input  tokens,  each  of  which  must  be  of  a 
specific  type.  When  all  required  tokens  are  available,  the  actor  is  enabled. 

(3)  An  actor  may  execute  only  when  enabled. 

(4)  An  actor  performs  some  computation,  generating  output  tokens  that  depend  only  on  the 
input  tokens. 

(5)  An  actor  need  not  send  the  same  output  tokens  to  all  its  recipients. 

(6)  A  actor  may  have  internal  state  during  the  course  of  a  single  execution  but  no  state 
information  is  preserved  from  one  execution  to  another. 

Similar  to  a  dataflow  graph,  a  macro-dataflow  graph  is  a  high-level  view  of  a  program. 
Nodes  in  this  graph  are  actors.  There  is  an  arc  from  the  actor  t;  to  the  actor  u  when  there  is  data 
dependency  between  v  and  u.  Tokens  flow  along  the  arcs  between  actors  carrying  both  data  and 
control  information.  Parallel  execution  of  a  macro  dataflow  program  can  be  realized  by  firing  each 
actor  as  it  becomes  enabled.  It  is  not  necessary  to  be  concerned  with  synchronizing  the  execution  of 
individual  actors  within  a  program  because  all  allowed  orders  of  execution  of  actors  vis  a  vis  each 
other  are  specified  by  the  arcs  in  the  macro  dataflow  graph. 

m.2.  Persistent  Actors 

The  macro  dataflow  model  described  so  far  is  a  straight  forward  extension  of  the  dataflow 
model.  It  makes  no  provision  for  the  sharing  of  information  between  programs,  or  side  effects  of 
any  program  on  other  programs.  All  information  transfers  occur  through  the  use  of  token  flow. 
Hence,  there  is  no  transfer  of  information  between  dataflow  programs  unless  there  are  arcs 
connecting  actors  in  the  corresponding  dataflow  graphs.  These  connecting  arcs  effectively  make  the 
programs  into  a  larger  program.  In  general,  it  is  difficult  to  model  interprocess  communication 
and  global  serialization  required  by  many  applications  using  this  scheme.  This  scheme  presumes 
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the  knowledge  of  other  programs,  their  relationship  to  each  other,  and  the  order  of  their 
executions.  This  knowledge  is  often  not  possible.  For  example,  suppose  that  the  programs  are 
transactions  to  a  database.  Information  is  transferred  between  transactions  since  they  access  the 
same  database.  It  is  necessary  to  serialize  the  executions  of  transactions  that  read  and  write  the 
same  data  items  in  order  to  maintain  database  consistency.  But  it  is  not  possible  to  construct  a 
composite  dataflow  graph  modeling  interleaved  executions  of  all  possible  transactions  to  the 
database  allowed  by  the  database  manager. 

To  model  communication  between  programs  and  side  effects  of  programs,  we  introduce  the 
notions  of  persistent  data  and  persistent  actors.  Persistent  actors  have  the  same  characteristics  as 
regular  actors  except  (I),  (4),  and  (6)  discussed  above.  A  persistent  actor  maintains  state 
information  that  is  preserved  from  one  firing  to  the  next.  Hence,  the  output  tokens  generated  by  a 
persistent  actor  for  different  firings  are  not  necessarily  the  same  for  the  same  input  tokens.  Since 
each  instance  of  a  particular  persistent  actor  type  can  have  a  different  internal  state,  different 
instances  are  not  identical.  We  note  that  the  notion  of  persistent  actors  is  very  similar  to  the 
concept  of  monitors  in  Concurrent  Pascal  and  objects  in  object-oriented  systems.  By  adding 
persistent  actors  to  the  macro  dataflow  model,  the  arcs  between  actors  in  a  program  graph  no  long 
completely  specify  all  data  dependencies  between  all  granules  of  computation  carried  out  by  the 
system.  In  particular,  persistent  actors  provide  us  with  a  way  to  model  information  transfer 
between  actors  in  different  programs  as  well  as  within  a  program.  We  propose  to  use  the  dataflow 
graph  of  each  program  to  completely  specify  data  dependency  relationship  between  actors  within 
each  program  and  limit  the  use  of  persistent  actors  to  model  synchronization  and  serialization 
between  different  programs. 

To  illustrate  the  use  of  persistent  actors,  we  consider  a  database  system  as  an  example. 
Suppose  that  every  transaction  (i.e.,  a  program)  in  this  system  is  represented  by  a  macro  dataflow 
graph  containing  four  actors:  database-read,  select,  project,  and  database-write.  Suppose  that  in 
the  macro  dataflow  graph,  these  actors  are  connected  as  a  directed  path;  select  and  project  are 
enabled  when  database-read  is  completed  and  database-write  is  enabled  when  select  and  project 
are  both  completed.  This  graph  models  the  dependency  between  operations  within  each 
transaction.  However,  it  does  not  model  the  interference  between  operations  of  different 
transactions.  To  model  this  type  of  interaction  between  programs,  we  let  select  and  project  be 
regular  actors  and  database-read  and  database-write  be  persistent  actors.  To  be  specific,  suppose 
that  a  server  module  called  database  manager  is  provided  by  the  system  to  execute  the  actors 
database-read  and  database-write.  The  database  manager  enforces  some  form  of  concurrency 
control  policy.  As  a  result,  the  value  returned  by  the  database  manager  after  serving  a  database- 
read  request  may  be  data  retrieved  on  the  behalf  of  the  transaction  or  a  “database-read  denied” 
message.  Similarly,  a  database-write  operation  may  be  successively  carried  out  or  failed  depending 
on  the  operations  of  other  transactions.  This  type  of  interactions  can  be  modeled  by  having  the 
internal  state  of  persistent  actors  preserved  from  one  firing  to  another.  The  output  produced  by 
persistent  actors  database-read  and  database-write  depend  not  only  on  their  input  tokens,  but 
also  on  the  sequence  of  read  and  write  operations  already  carried  out  by  the  database  manager, 
that  is,  the  previous  firing  of  these  actors. 

When  a  database- reed  request  is  denied,  subsequent  operations  (i.e.,  select,  project,  and 
database-write  in  our  example)  are  aborted  and  some  appropriate  recovery  action  is  then  taken  in 
typically  database  systems.  One  way  to  model  aborted  operations  is  to  let  each  firing  of  a 
persistent  actor  have  at  least  two  possible  outcomes.  The  first  outcome  is  “success”.  In  this  case, 
the  output  tokens  generated  by  the  persistent  actor  are  sent  to  enable  actors  connected  to  it  in  the 
subgraph  representing  the  program  for  the  transaction.  We  call  the  portion  of  the  computation 
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following  the  persistent  actor  its  success  continuation.  The  second  outcome  is  “failure’7  (e.g.,  due 
to  a  synchronization  error,  or  a  data  out  of  bounds.)  In  this  case,  output  tokens  are  directed  to 
actors  which  perform  whatever  operations  are  necessary,  (e.g.,  rollback,  re-try,  notify  user,  etc..) 
For  example,  these  actors  may  be  in  the  subgraph  representing  the  program  for  rollback  and 
recovery  functions.  We  call  this  the  failure  continuation  of  the  persistent  actor.  In  general,  other 
continuations  are  also  possible.  There  is  no  need  to  specify  what  they  are  here. 

DI.3.  Macro  Data  Flow  Program  Composition 

A  macro  data-flow  program  specifies  the  actors  performing  the  desired  computation 
primitives  and  the  graph  describing  the  dependency  between  the  actors.  To  produce  such  a 
program,  a  programmer  may  take  the  traditional  approach  to  let  a  compiler  determine  the 
parallelism  inherent  in  his  source  code  (such  as  calls  to  predefined  procedures  or  code  segments 
without  data  dependencies)  and  generate  the  actors  and  the  program  graph.  Alternatively, 
programmer  may  use  any  existing  types  of  actors  (i.e.,  program  modules  capable  of  carrying  out 
the  corresponding  computation  primitives,)  or  write  new  types  of  actors  to  provide  functionalities 
not  supported  by  any  existing  type.  The  new  types,  once  written  and  defined,  become  existing 
types  and  can  be  used  later  in  other  macro  dataflow  programs.  The  entire  macro  dataflow  program 
would  be  represented  as  a  hierarchical  graph  constructed  from  actors  and  predefined  subgraphs. 
Thus,  actor  types  are  reusable  program  segments  and  the  macro  dataflow  model  may  be  used  to 
support  software  reuse. 

We  note  that  an  object-oriented  approach  to  software  design  and  implementation  provides 
the  ideal  support  to  the  design  and  development  of  macro  dataflow  programs.  Indeed,  macro  actor 
types  are  functional  objects.  These  objects  can  be  instantiated  to  provide  the  required 
functionalities.  The  arcs  carrying  tokens  in  the  macro  dataflow  graphs  are  also  objects.  These 
objects  are  instantiated  to  provide  a  communication  facility  between  the  functional  objects. 
Objects  can  be  easily  implemented  in  programming  languages  such  as  Ada  which  provide  packages 
to  encapsulate  objects,  hide  implementation  details,  and  allow  parametrized  interfaces. 

In  general,  the  process  of  choosing  the  functionality  and  complexity  of  the  computation 
primitives,  and  hence  actor  types  or  functional  objects,  is  a  software  system  design  process.  The 
ideal  choices  of  computation  primitives  are  application  dependent.  For  applications  where  good 
response  and  high  throughput  are  essential,  the  computation  primitives  should  be  chosen  to 
achieve  near  optimal  trade-off  between  low  communication  overhead  and  a  high  degree  of 
parallelism.  For  applications  where  fault  tolerance  and  availability  are  essential,  computation 
primitives  should  be  chosen  so  that  they  correspond  to  atomic  actions  that  can  be  carried  out 
reliably  and  supported  on  a  highly  available  basis. 

IV.  Macro  Dataflow  Distributed  Architecture 

We  propose  to  support  the  macro  dataflow  model  on  a  distributed  system  in  which  entities 
capable  of  executing  computation  primitives  are  server  modules.  Each  host  may  choose  to  provide 
a  set  of  server  modules  to  support  frequently  required,  well  defined  functionalities  on  the  behalf  of 
user.  Examples  of  server  modules  include  those  which  perfo’  n  sort,  merge,  Ada-compile,  etc. 
when  supplied  with  input  data.  These  modules  have  attribuf  s  of  library  routines;  they  perform 
functions  based  on  inputs  arguments  or  manipulate  data  structures.  Other  examples  of  server 
modules  include  database  managers  and  file  servers.  These  server  modules  have  attributes  of 
processes  and  may  maintain  persistent  data.  In  general,  hosts  may  provide  server  modules  such  as 
terminal  handlers,  message  handlers,  voice  and  video  image  mailers,  image  processors,  printer 
servers,  text  processors,  language  compilers,  communication  servers,  computation  servers,  file 
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servers,  etc..  Jointly,  these  server  modules  support  a  comprehensive  set  of  communication  and 
computation  primitives,  isolating  the  user  from  the  boundaries  of  host  systems  and  networks,  as 
well  as  from  differences  in  operating  systems. 

Even  though  the  macro  dataflow  programs  are  very  similar  to  dataflow  programs  except  for 
the  size  and  complexity  of  computation  primitives,  the  architectural  support  required  by  the 
macro  dataflow  computation  is  very  different  from  that  of  the  dataflow  machines.  In  a  macro 
dataflow  system,  the  execution  of  each  computation  primitive,  corresponding  to  a  client  process  to 
some  server  module,  is  carried  out  whenever  all  the  necessary  input  data  and  control  information 
are  available  to  the  module,  i.e.,  in  parallel  whenever  possible.  The  need  to  have  a  system-wide 
controller  for  scheduling  and  synchronizing  task  execution  purposes  is  eliminated.  Thus,  the  overall 
system  control  structure  can  be  simplified  and  its  robustness  improved.  Server  modules  that  can 
be  instantiated  and  activated  to  execute  computation  primitives  on  each  host  are  computational 
resources  on  the  host.  These  server  modules  play  the  same  role  in  a  macro  dataflow  system  as  the 
roll  played  by  hardware  resources  capable  of  carrying  out  machine  instructions  in  dataflow 
machines.  However,  unlike  hardware  resources,  server  modules  can  be  dynamically  instantiated  on 
any  host,  and  once  instantiated,  can  be  destroyed  or  migrated  to  different  hosts.  For  this  reason, 
resource  management  in  a  macro  dataflow  system  is  more  complex  than  in  dataflow  machines, 
where  hardware  resources  are  often  allocated  to  computations  statically.  When  a  computation 
primitive  in  a  program  is  to  be  carried  out,  the  resource  manager  in  a  macro  dataflow  system  must 
identify  the  type  of  server  modules  capable  of  carrying  out  the  primitive,  locate  or  instantiate  one, 
and  assign  it  to  the  program.  A  distributed  name  server  should  be  used  to  provide  naming  and 
directory  services  to  the  resource  manager,  in  this  case.  (This  name  server  should  not  be  more 
complex  than  the  name  servers  required  to  support  location-transparent,  interprocess 
communication  in  traditional  distributed  systems,  since  much  of  the  same  functionalities  are 
required  in  both  cases.)  To  support  naming  and  directory  services  will  need  precise  functional 
specifications  and  interfaces  specifications  of  server  module  types 

A  concept  that  may  provide  a  general  framework  for  the  specifications  of  server  module  types 
is  the  concept  of  virtual  services  and  virtual  service  protocols.  A  virtual  service  protocol  is  a 
presentation-layer  protocol  compatible  with  the  ISO  Open  System  Interconnection  (OSI)  reference 
model.  It  provides  a  structured  approach  for  characterizing  and  defining  a  (virtual)  service 
provided  by  a  type  of  server  modules,  as  well  as  for  specifying  the  interactions  between  a  server 
module  providing  the  service  and  a  client  using  the  service.  In  other  words,  a  virtual  service 
protocol  defines  a  generic  service  and  presents  a  standardized  interface  to  server  modules  providing 
the  service.  This  interface  can  be  parametrized  using  the  concept  of  option  negotiation  first 
introduced  in  virtual  terminal  protocols.  (Stated  in  terms  used  in  software  specification,  a  virtual 
service  protocol  is  equivalent  to  the  functional  and  interfaces  specifications  of  server  module  type.) 
Virtual  service  protocols  generalize  the  well  known  notions  of  virtual  terminal  protocol,  file 
transfer  protocol,  and  virtual  graphics  protocol,  which  are  used  to  support  terminal  handling,  file 
transfer,  and  graphics,  respectively,  to  provide  a  coherent  set  of  utilities  in  a  distributed  system. 

V.  Summary 

The  macro  dataflow  model  for  distributed  computation  is  presented.  Many  different  models 
of  distributed  and  concurrent  computations  such  as  remote  procedure  calls,  parbegin-end 
constructs,  monitors,  and  objects  can  be  easily  represented  in  this  model.  It  is  patterned  after  the 
dataflow  model,  but  with  two  important  differences.  First,  the  granularity  of  the  computation 
primitives  in  the  macro  dataflow  model  is  arbitrarily  complex.  Using  more  computationally 
complex  actors  reduces  the  communication  ratio  of  macro  dataflow  programs,  and  makes  the 
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underlying  network  less  of  a  bottleneck.  Second,  the  notion  of  the  persistent  actors  is  introduced  to 
model  inter-program  communication  and  side  effects  of  programs.  Persistent  actors  also  provide  a 
convenient  way  to  model  abstract  data  objects,  servers,  and  other  constructs  that  are  awkward  to 
model  in  the  dataflow  model.  Many  questions  remain  open.  One  open  question  is  how  to  model 
fault-tolerant  computation  in  the  macro  dataflow  model.  Another  question  is  whether  secure 
access  control  and  information  flow  control  mechanisms  can  be  more  easily  designed  in  in  a  macro 
dataflow  system. 

This  paper  proposes  a  distributed  system  architecture  based  on  the  macro  dataflow  model  of 
computation.  This  architecture  appears  to  provide  a  framework  within  which  high  degrees  of 
parallelism,  fault  tolerance,  and  availability  can  be  achieved  naturally  in  a  distributed 
environment.  An  important  problem  to  be  solved  concerns  with  the  provision  of  necessary 
underlying  support  to  the  resource  management  function.  Naming,  location,  and  selection  of 
server  modules  to  perform  computation  primitives  invocated  in  programs,  and  the  use  of 
techniques  such  as  atomic  remote  procedure  calls  and  resilient  procedures  to  ensure  fault  tolerance 
and  improve  availability  are  some  of  the  issues  to  be  addressed. 
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ABSTRACT 

There  are  many  heuristic  algorithms  in  the  litera¬ 
ture  for  task  assignment  in  Distributed  Computing 
Systems.  A  common  goal  in  all  of  these  algorithms 
is  to  minimize  the  communication  cost  of  an  as¬ 
signment.  Each  of  the  researchers  who  develop 
heuristic  algorithms  claim  that  their  algorithms 
perform  very  Nell.  The  purpose  of  this  paper  is  to 
compare  the  performance  of  four  algorithms  in  the 
literature  in  terms  of  both  accuracy  and  algorith¬ 
mic  efficiency.  It  is  conluded  that  the  trade-off 
between  optimal  but  slow  algorithms  v.s.  sub- 
optimal  but  efficient  ones  heavily  favors  the 
heuristic  approach.  A  heuristic  algorithm  is  de¬ 
rived  to  combine  the  nos*  successful  parts  of  the 
reviewed  paperst  and  some  remaining  important 
problems  are  identified. 


INTRODUCTION 

He  loosely  define  a  distributed  computing  sys¬ 
tem  as  a  combination  of  two  or  more  computers > 
each  with  private  memory  and  which  works  under  the 
control  of  a  two-layer  operating  system:  The  loner 
layer  consists  of  the  collection  of  local  operat¬ 
ing  systems  that  control  the  local  activities  of 
individual  computers!  and  the  higher  layer  is  a 
network-wide  operating  system  which  coordinates 
the  activities  involved  in  resource  and  load  shar¬ 
ing.  Each  of  these  layers  may  contain  several  sub¬ 
layers  as  suggested  in  the  7-layer  ISO  model  [ll]> 
but  for  the  purposes  of  this  paper  these  sub¬ 
layers  are  transparent.  What  is  important!  how¬ 
ever!  is  thati  the  communication  between  computers 
occurs  over  some  communication  medium  with  a  band¬ 
width  several  times  lower  than  that  between  a 
processor  and  its  local  memory. 

An  important  research  area  in  the  design  of  the 
higher  level  operating  system  is  development  of 
efficient  algorithms  for  load  sharing.  The  impor¬ 
tance  of  load  sharing  can  be  attributed  to  the  ex¬ 
pected  level  of  power  improvement  that  would  be 
attained.  A  measure  of  power  of  any  computer  sys¬ 
tem  can  be  defined  as  the  ratio  T/R,  where  T  is 
the  throughput  and  R  is  the  average  response  time 
of  the  system.  To  maximize  power  u:  need  to  maxi¬ 
mize  throughput  while  minimizing  average  response 
time.  Without  load  sharing!  the  total  throughput 
of  a  system  of  P  processors  will  heavily  depend  on 
the  distribution  of  service  requests  arriving  at 
individual  processors.  In  this  sense,  the  purpose 
of  load  sharing  is  to  re-distribute  service  re¬ 
quests  uniformly  over  the  set  of  available 
processors!  so  that  the  sensitivity  of  the  power 
of  a  system  to  distribution  of  service  request  ar¬ 
rivals  at  different  sites  may  be  minimized.  In 
generali  wa  expect  the  distribution  of  job  ar¬ 


rivals  to  individual  sites  to  be  variable.  There¬ 
fore.  dynamic  algorithms  are  needed  for  real-time 
assignment  of  jobs  to  processors. 

In  the  literature  both  deterministic  and  prob¬ 
abilistic  models  have  been  successfully  used  in 
the  area  of  multiprocessor  scheduling,  where  com¬ 
munication  delays  between  processors  are  negligi¬ 
ble  (either  due  to  shared  memory  or  high  bandwidth 
communication  medium).  However  these  models  can 
not  ba  extended  to  distributed  processing  where 
communication  delays  are  high.  Recently,  load 
balancing  algorithms  were  proposed  for  distributed 
computing  systems  based  on  some  form  of  "bidding" 
or  "drafting"  protocol  [9.10].  but  these  algo¬ 
rithms  are  only  applicable  for  the  assignment  of 
independent  tasks,  or  when  a  task  has  been  parti¬ 
tioned  into  its  constituent  processes  before  allo¬ 
cation  starts. 

Wa  are  particularly  interested  in  the  problem 
of  task  partitioning  which  minimizes  the  communi¬ 
cation  costs  subject  to  constraints  such  as  real¬ 
time.  resource  availability.  or  any  other 
constraints  that  may  be  desirable  for  a  particular 
application.  Such  a  problem  has  been  formulated 
as  an  integer  programming  problem  by  a  nu riser  of 
researchers.  However.  since  the  problem  is 
NP-conplete,  good  heuristic  algorithms  hove  re¬ 
ceived  due  attention  in  the  literature.  The  pui — 
pose  of  this  paper  is  to  compare  the  performance 
of  a  number  of  heuristic  algorithms  proposed  by 
different  authors.  For  this  comparison  we  give  a 
mathematical  programming  model  of  the  problem 
first.  We  then  describe  the  heuristic  algorithms 
proposed  for  the  approximate  solution  of  this 
problem.  Since  no  two  models  are  exactly  the 
same,  a  direct  comparison  is  not  possible.  There¬ 
fore  we  had  to  make  small  adaptations  so  that  a 
uniform  common  ground  may  be  derived,  and  a  single 
set  of  terminology  may  be  used  without  changing 
the  basic  algorithm. 


Problem  Formulation 

He  assume  that  the  processors  in  a  distributed 
system  are  fully  interconnected.  A  task  consists 
of  a  number  of  modules  that  may  communicate  with 
each  other,  and  is  characterized  by  two  matrices: 
(a)  A  cost  matrix  [t(i,j)l  which  indicates  the  ex¬ 
ecution  cost  of  module  i  on  processor  j,  and,  lb) 
A  communication  matrix  [c(i,k>]  which  represents 
the  cost  of  communication  between  modules  i  and  k. 
If  two  modules  i  and  k  do  not  con.r.iuni  cojto  with 
each  other  then  c(i,k)=0,  and  c(i,k)>0  otherwise. 
A  decision  matrix  [x(i.j>]  is  defined  which  re¬ 
presents  an  assignment  such  that  x < i , j  )  =  1  if  mod¬ 
ule  i  is  assigned  to  processor  j,  and  x(i,jl  =  0 
otherwise.  The  total  execution  cost  of  an  assign¬ 
ment  is  expressed  as: 


I  I  tti.j)  x(i,j) 

i  j 

If  the  coaputiri  are  homogeneous  this  term  is 
eliminated  from  the  problem  formulation.  The  total 
communication  cost  of  an  assignment  incurred  be¬ 
tween  modules  not  assigned  to  the  same  processor 
is: 

££  [c(i.k)(l-x(iij))  xlk.jl 

i  j  k>i 

Hence  the  total  cost  of  an  assignment  is  the  sum 
of  coamunicai ion  and  processing  costs: 

COSTaJ  £  ( t(  i .  j  )x(  i ,  j  )  ♦  \  cC  i  .R  It  l-x(  i ,  j  )  >x(k,j  ) ) 

i  j  k>i 

The  optimal  assignment  is  one  which  minimizes 
this  cost  function  subject  to  various  constraints 
that  may  be  specified.  Table  1  shows  the  con¬ 
straints  included  in  the  models  which  we  consider 
in  this  paper.  In  all  of  these  models*  minimiza¬ 
tion  of  the  communication  cost  between  communicat¬ 
ing  processes  is  considered  to  be  the  primary  goal 
of  task  assignment.  All  of  these  models  (but  one) 
assune  that  processors  are  homogeneous*  and  the 
cost  of  communication  depends  solely  on  the  volume 
of  data  (distance  factor  is  ignored).  Other  con¬ 
siderations  between  various  models  include:  dif¬ 
ferent  processor  speeds  [9]*  processors  with 
unique  resources  r2.5]*  real-time  constraints 
(2>4»A]*  and  load  balancing  [{]. 
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communication  cost  of  an  assignment  equals  to  tho 
total  communication  between  these  clusters  since 
we  assume  that  the  communication  cost  between  the 
nodes  within  a  cluster  is  zero.  In  the  rest  of 
this  paper  we  refer  to  each  algorithm  by  the  ini¬ 
tials  of  its.  author. 


Alnorithm  HG 

One  of  the  earlier  algorithms  developed  for 
minimizing  the  communication  cost  of  an  assignment 
is  due  to  K.  Hacssig  and  C.J.  Jenny  [  5  ].  In  this 
model  it  is  assumed  that  some  of  the  nodes*  called 
‘‘distinguished  nodes,**  have  some  unique  resource 
requirements  and  that  there  is  a  one-to-one  corre¬ 
spondence  between  distinguished  nodes  and 
processor  locations.  Therefore  the  assignment  of 
distinguished  nodes  are  fixed.  The  rest  of  nodes 
are  subject  to  assignment  for  minimization  of  com¬ 
munication  costs.  A  partitioning  is  said  to  be 
"feasible”  if  there  is  exactly  one  distinguished 
node  in  each  of  the  clusters  obtained  by  the  as¬ 
signment. 

The  algorithm  consists  of  two  phases:  In  the 
first  phase  a  feasible  solution  is  obtained  which 
is  then  improved  in  the  second  phase.  For  the  de¬ 
termination  of  a  feasible  solution  the  process 
graph  is  first  extended  as  follows:  Assume  in  the 
process  graph  of  figure  l.a  the  nodes  VI  and  V2 
are  distinguished  nodes.  Ue  introduce  an  addi¬ 
tional  node*  V7  in  this  case*  and  connect  this 
node  to  the  distinguished  nodes  VI  and  V2  with  arc 
weights  larger  than  any  other  in  the  process 
graph.  After  this  extension,  we  find  the  maximum 
spanning  tree  of  this  graph  as  shown  in  figure 
l.b.  Now  if  we  drop  the  node  V7  and  its  outgoing 
arcs  we  obtain  a  disconnected  graph  which  contains 
all  of  the  nodes  of  the  original  process  graph. 
This  partitioning  of  nodes  forms  a  feasible  sol¬ 
ution  and  a  first  approximation  to  the  optimum 
cut-sat. 


As  we  see  in  table  1*  no  two  models  are  exactly 
seme.  Therefore  a  direct  comparison  is  not  possi¬ 
ble.  In  the  rest  of  this  paper  we  first  describe 
how  these  algorithms  try  to  minimize  the  communi¬ 
cation  costs  in  an  assignment  and  compare  them  in 
terms  of  running  time  complexity.  Ue  then  describe 
how  various  algorithms  are  extended  to  handle  the 
real-time  constraints*  and  compare  their  rate  of 
success  in  handling  the  real-time  constraint  as 
described  by  each  model.  Ue  close  this  discussion 
by  giving  the  results  of  this  research  and  dis¬ 
cussing  further  problems  for  future  research. 


II  ALGORITHMS  FOR  MINIMIZATION  OF  COMMUNICATION 
COSTS: 

A  task  to  be  partitioned  is  represented  by  an 
undirected  graph  in  which  nodc3  represent  the  pro¬ 
gram  modules  and  arcs  represent  the  communication 
between  them.  Nodes  and  arcs  are  labeled  by  num¬ 
bers  that  represent  the  cost  of  module  execution 
and  cornmunicatibn*  respectively.  Also  character 
labels  are  associated  with  nodes  for  convenience 
in  referring  to  a  node  in  discussions  below.  A 
partitioning  of  the  graph  that  represents  a  task 
designates  the  allocation  of  nodules  represented 
by  the  nodes  in  the  graph,  as  each  subgraph  ob¬ 
tained  after  partitioning  represents  a  cluster  of 
nodes  to  be  assigned  to  the  same  processor.  The 


Figure  1:  Phase  1  of  Algorithm  HG. 

(a)  Process  graph,  (b)  Maximum  spanning 
tree . 
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In  the  second  phase  some  nodes  ere  relocated 
fro*  one  subgraph  to  another  if  it  "pays"  to  do 
so.  To  describe  the  proposed  algorithm  for  the  re¬ 
location  of  nodes,  assume  a  graph  is  partitioned 
into  two  subgraphs  as  shown  in  figure  2.  Here 
CUT-1  represents  the  partitioning  found  at  the  end 
of  the  first  phase  and  A  represents  the  cost  of 
this  partitioning.  Wo  now  take  the  sub-graphs  one 
at  a  time  and  check  if  a  cut  with  lower  communi¬ 
cation  exists  between  the  distinguished  node  and 
another  node  in  that  cluster.  If  a  new  cuti  say 
CUT-1. a  is  found  with  cost  B  such  that  B<A  then 
this  cut  is  preferred  over  CUT-1.  This  process  is 
repeated  for  every  cluster,  reassigning  some  nodes 
■ore  than  once  if  necessary  (e.g.  if  a  cut  is 
found  with  cost  C  such  that  C<B>. 


To  determine  whether  such  a  cut  exists,  the  au¬ 
thors  propose  to  use  a  min-cut  max-flow  algorithm 
between  the  distinguished  node  and  other  nodes  in 
the  same  cluster.  Although  the  authors  propose 
ways  to  minimize  the  number  of  attempts  to  find 
such  reassignments,  in  the  worst  case  we  attempt 
as  many  times  as  there  are  nodes  in  the  cluster. 

The  most  time  consuming  part  of  this  algorithm 
is  the  second  phase  where  a  min-cut  max-flow  algo¬ 
rithm  is  used.  In  the  worst  case,  for  a  graph  with 
V  nodes  P  of  which  are  distinguished,  the  first 
phase  may  assign  just  the  distinguished  nodes  in 
the  first  (p-l)  clusters  and  the  remaining  (V-P+l) 
nodes  to  the  Pth  cluster.  For  an  reassignment  to 
be  made,  we  may  need  to  repeat  the  min-cut  max- 
flow  algorithm  (V-P)  times  on  a  graph  with  (V-P+l) 
nodes.  Since  the  running  time  complexity  of  the 
min-cut  max-flow  algorithm  is  a  third  order  func¬ 
tion  of  the  number  of  nodes  [l]»  for  the  above  al¬ 
gorithm  we  have  a  worst  case  estimation  of 
0(<V-P><V-P+1 )  3),  or  if  P«V,  we  have  0(V4). 

Algorithm  GE 

GYlys  and  Edwards  [4]  proposed  a  much  simpler 
algorithm  than  the  one  above.  The  basic  idea  is 
again  to  form  clusters  of  nodes  in  such  a  way  that 
the  total  inter-cluster  communication  is  minimum. 
Initially  we  start  with  V  clusters  such  that  there 
is  a  cluster  for  each  node  and  that  no  node  is 
contained  in  more  than  one  cluster.  Each  cluster 
is  represented  by  a  node  in  the  process  graph.  For 
e  partitioning  between  P  processors  the  algorithm 
can  be  simply  described  as: 


1.  Repeat  steps  l.a  and  l.b  until  number  of  nodes 
is  reduced  to  P 

l.a  Find  the  cluster  pair  with  maximum  communi¬ 
cation 

l.b  Fuse  them 

The  process  of  fusion  in  step  l.b  involves  mod¬ 
ifying  the  process  graph  so  that  the  two  clusters 
are  now  represented  by  a  single  node  (see 
figure. 3),  the  arcs  from  the  new  node  to  others 
represent  the  combined  communication  of  nodes  in 
the  cluster. 


Figure  3:  (a)  Process  graph,  (b)  aftar 
fusing  C  and  D.  (c)  After  fusing  A  and  B. 


This  algorithm  necessitates  finding  the  arc 
with  minimum  weight  at  each  iteration.  If  there 
are  E  arcs  in  the  process  graph,  0(E)  will  be  the 
search  complexity  by  using  a  linear  search  algo¬ 
rithm.  Since  this  process  will  be  repeated  (V-P) 
times,  the  algorithmic  complexity  will  be 
0((V-P)E)»  or  if  P«V  we  have  0(VE). 


The  authors  also  report  the  results  from  other 
clustering  techniques  based  on  some  measure  of  as¬ 
sociation  between  modules.  TF^y  experimented  with 
many  different  measures  in  the  literature,  and 
found  that  the  success  of  these  algorithms  heavily 
depends  on  the  "right"  choice  of  initial  cluster 
centroids.  Due  to  the  lack  of  good  algorithms  to 
determine  the  "right"  choice  in  a  process  graph, 
we  are  unable  to  pursue  this  option  further  at 
this  stage. 


Algorithm  E 


In  another  algorithm  proposed  by  K.  Efe  [2], 
instead  of  searching  for  the  maximum  of  all  arc 
weights,  we  search  for  a  node  pair  which  communi¬ 
cate  with  each  other  more  than  they  communicate 
with  the  rest  of  the  nodes  in  the  process  graph. 
More  formally,  we  search  for  a  node  pair  k  and  1 
such  that 


c(k,l)  >  c(k,i  );  i  =  l..V,  i=/=l 
c(k,l)  >  c) l,i  ) ;  i=l..V,  i=/=k 


To  find  such  a  pair,  the  algori thm  first  subdi¬ 
vides  the  set  of  all  arcs  S(  E  )  into  V  sets  of 
Slit,  where  S( i  )  is  the  set  of  ores  emanating  from 
noda  i.  The  algorithm  proceeds  as  follows:  (The 
reader  should  cosily  follow  the  steps  of  this  al¬ 
gorithm  by  considering  the  figure-4) 


•V-v* 

■  *  * 
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1.  Pick  a  node  i  at  random 

2.  Find  j  such  that 

c(i,j)=MAX{S(i  )} 

3.  Find  k  such  that 

c(  j,k)=MAX{S(  j)} 

4.  If  i=k  than  axacuta  4.a>  else  execute  4.b. 

4. a  Fuse  i  and  j  and  assign  label  i  to  the  new 
node  obtained  after  this  fusion.  Set  V  <--  V-ls  if 
V=P  then  stop,  else  go  to  step-2. 

4.b  Set  i  < —  jl  j  k>  go  to  step-3. 

The  eore  general  version  of  this  algorithm  de¬ 
scribed  in  2  also  considers  the  notion  of  dis¬ 
tinguished  nodes.  The  modification  required  for 
partitioning  with  distinguished  nodes  in  the  graph 
is  rather  simple:  When  a  pair  of  nodes  arc  found 
that  satisfies  the  inequality  above>  we  check  if 
they  are  both  distinguished.  If  at  least  one  of 
the  nodes  is  a  nen-distinguished  node  then  we  fuse 
them,  otherwise  the  arc  between  them  is  deleted 
and  search  continues  from  the  node  i. 

In  steps  (2)  or  (3)  of  this  algorithm  a  search 
is  made  among  S*  arcs,  where  S»  =  {  S(  i  )  1 

i*l..V}.  In  the  worst  case  we  may  walk  through  V 
nodes •  searching  among  the  2E  arcs  contained  in 
all  sets  of  Still  i=l..V.  Hence  the  total  number 
of  elements  will  be  2E  for  each  fusion,  and 
0<(V-PI2Ei  for  the  entire  algorithm.  Comparing  to 
the  algorithm  6E  this  order  of  complexity  seems  to 
be  slower  by  a  factor  of  two  However,  simulation 
results  show  that  algorithm  E  is  indeed  marginally 
fastar  than  algorithm  6E  because  of  the  way  the 
search  is  made.  Notice  that  the  running  time  com¬ 
plexity  of  algorithm  GE  is  exact,  while  in  algo¬ 
rithm  E  the  search  stops  without  necessarily 
searching  all  possibilities. 


Another  model  that  is  based  on  min-cut  max-flow 
algorithm  is  proposed  by  V.M.  lo  [s].  In  this 
model  processor  speeds  are  assumed  to  be  diffei — 
ent.  A  system  of  V  modules  and  P  processors  is 
modeled  os  a  graph  in  which  each  processor  is  a 
distinguished  node  (see  figure-51.  An  arc  is  drown 
from  each  process  node  i  to  each  distinguished 
node  q  with  the  weight 


Figure  5 

c<  i , q )  e -  J  tti  ,r )  - 
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Where  t(i,r)  is  the  cost  of  executing  module  i  on 
processor  r.  If  the  processors  ore  homogeneous, 
that  is,  if  t(i»r)=t(i)  for  r=l..P,  then  this 
equation  reduces  to  c(  i  ,q)=t< i >/( P-1 ) .  In  a  com¬ 
parison  with  the  algorithms  above  it  is  more  ap¬ 
propriate  to  use  this  reduced  form  of  c(i,q)  to 
make  the  modelling  assumptions  similar. 

As  in  the  algorithm  HJ,  a  feasible  partitioning 
contains  exactly  one  distinguished  node  in  each 
cluster.  If  there  are  only  two  distinguished  nodes 
a  min-cut  max-flow  algorithm  guarantees  the  opti¬ 
mum  result.  For  more  than  two  processors,  the  au¬ 
thor  suggests  a  three  phase  heuristic  algorithm  as 
follows: 

1.  Repeat  phase  1  (below)  until  assignment  is  com¬ 
plete  or  no  assignment  is  made  in  the  last  iter¬ 
ation. 

2.  If  there  are  nodes  remaining  unassigned  then 
try  to  make  e  "lump"  assignment  as  described  below 
<  phase  2 ) . 

3.  If  a  "lump"  assignment  can  not  be  made  then  use 
a  greedy  assignment  algorithm  as  described  in 
phase  3  below. 

Phase  I  To  force  a  cut  between  a  processor  node  k 
and  the  rest  of  the  graph,  let  P  be  the  set  of 
processor  nodes,  and  let  k'=P-{K}. 

1.  For  k=l..P  repeat  steps  l.a  and  l.b. 

l.a  Fuse  all  nodes  in  k’  into  a  single  node 
l.b  Run  min-cut  max-flow  algorithm  between  k  and 
k* 


2.  If  thtn  are  no  nodes  remaining  unassigncd,  or 
if  no  assignment  has  been  made  in  step  1  then  stop 
•Isa  execute  step  3 

3.  Recalculate  the  arc  weight  increments  for 
processor  nodes  (to  be  added  on  the  communications 
of  nodes  already  assigned)  on  the  reduced  graph  of 
reeaining  nodes  and  go  to  step  1. 

Phase  2  In  this  phase  a  lower  bound  L  is  calcu¬ 
lated  on  P-way  assignment  by  running  the  min-cut 
max-flow  algorithm  between  any  arbitrarily  chosen 
pair  of  processor  nodes.  If  L  is  greater  than  the 
cost  of  assigning  all  remaining  nodes  to  one  of 
the  processors  then  all  remaining  nodes  are  as¬ 
signed  to  one  processor.  Otherwise  phase  3  is  ac¬ 
tivated. 

Phase  3  In  this  phase  the  average  communication 
cost  C  between  all  remaining  nodes  is  calculated. 
Then  the  arcs  with  a  weight  less  than  or  equal  to 
C  are  dropped.  After  this  step  we  locate  clusters 
of  nodes  between  which  communication  costs  are 
"large".  Each  cluster  is  then  assigned  to  the 
processor  where  it  will  incur  minimum  communi¬ 
cation  cost. 

The  most  important  determinant  of  the  running 
time  of  this  algorithm  is  the  phase  1,  where  a 
•in-cut  max-flow  algorithm  is  used.  In  a  graph 
with  V  nodes ,  a  partitioning  between  P  processors 
will  require  running  this  algorithm  for  a  graph  of 
V*2  nodes.  Thus,  for  each  iteration  of  the  nin-cut 
max-flow  algorithm  we  have  the  running  time 
0((V+2)3).  In  the  worst  case  the  algorithm  may  as¬ 
sign  exactly  one  node  to  a  processor  at  each  i ter- 
at  ion,  completing  on  optimum  assignment  in  V 
iterations.  The  worst  case  running.time  will  then 
be  0(V(V+2)3),  which  is  roughly  0(V4). 

MODELS  FOR  REAL-TIME  CONSTRAINTS 

Three  of  the  four  algorithms  we  described  above 
provide  mechanisms  for  observing  a  real-time  con¬ 
straint.  Below  we  describe  these  algorithms. 

Algorithm  GE 

The  real-time  constraint  in  algorithm  GE  is  in¬ 
corporated  by  a  small  modification  of  the  one  de¬ 
scribed  above.  Before  the  algorithm  starts  we 
define  a  preset  upper  bound  L  on  the  total  running 
time  of  modules  assigned  to  each  processor.  In  the 
algorithm,  before  a  fusion  is  made  it  is  chocked 
if  a  single  processor  can  handle  the  two  clusters. 
If  the  total  running  time  of  nodes  in  the  two 
clusters  is  less  than  or  equal  to  L  then  they  are 
fused  into  a  single  cluster.  Otherwise  the  arc  be¬ 
tween  these  two  nodes  is  deleted  and  a  new  search 
is'  begun.  The  algorithm  terminates  when  no  more 
fusions  can  be  made. 

Algorithm  E 

In  this  algorithm  we  first  run  the  heuristics 
for  minimization  of  communication  cost  as  the 
first  phase.  If  the  result  docs  not  satisfy  the 
real-time  constraint  then  we  activate  the  second 
phase  which  is  dsscribcd  below. 


l.a  For  any  processor  with  load  level  within  a 
tolerance  range  L+T  delete  the  modules  assigned  to 
that  processor. 

l.b  If  a  processor  has  a  load  less  than  L-T  then 
represent  all  of  the  modules  in  that  processor  by 
a  single  node. 

l.c  If  a  processor  has  a  load  level  that  is 
greater  than  L+T  then  retain  til  of  the  nodes  re¬ 
presenting  modules  assigned  to  it. 

2.  Modify  the  arc  weight  for  the  nodes  in  category 
(l.b)  so  that  arc  weights  are  incremented  propor¬ 
tional  to  the  difference  between  processors  with 
light  load  and  processors  with  heavy  load. 

3.  Using  the  algorithm  E  reassign  some  nodes  to 
processors  with  light  load  until  the  total  proc¬ 
essing  time  in  these  processors  roach  the  range 
L*T. 

This  algorithm  is  persistent  in  the  sense  that 
it  is  designed  to  satisfy  the  real-time  constraint 
at  any  cost.  A  slightly  different  version  reported 
in  2  will  loosen  the  real-tine  constraint  if  the 
cost  is  going  to  be  too  high. 

Algorithm  L 

The  algorithm  L  described  above  does  not  pro¬ 
vide  any  facilities  to  meet  a  real-tina  con¬ 
straint.  Instead,  for  the  real-time  constraint  the 
author  describes  another  algorithm  based  on  a  com¬ 
bination  of  phase  (3)  of  algorithm  L  and  a  graph 
matching  algorithm.  However,  in  this  model  the 
renl-tilr.e  constraint  is  specified  in  ten;.;  of  an 
upper  bound  on  the  rm.-hmr  of  modules  at  a 
processor.  Due  to  this  difference  in  tha  specifi¬ 
cation  of  real-time  constraint  we  are  unable  to 
make  a  direct  comparison.  However,  what  is  worth 
noting  here  is  that,  according  to  tha  author  the 
algorithm  found  the  optimum  result  in  S2.9X  of  the 
tested  cases. 

IV  COMPARISON  OF  ALGORITHMS 

We  have  seen  that  algorithms  HJ  and  L  have  run¬ 
ning  time  complexities  of  roughly  OCv*).  Between 
these  two,  algorithm  L  seems  to  perform  bettor, 
because  the  author  is  able  to  prove  that  if  tho 
assignment  is  ccrplete  at  the  end  of  phase  1  or  2 
then  it  is  optimum.  This  is  a  noteworthy  point 
since  it  imposes  a  certain  degree  of  confidence  in 
the  performance  of  the  algorithm.  On  the  other 
hard,  one  important  fact  remains:  since  the  prob¬ 
lem  being  solved  does  not  exactly  represent  the 
real  world,  it  is  probably  just  as  appropriate  to 
consider  a  less  accurate  solution  if  it  means  a 
substantial  saving  in  the  running  time  complexity. 
The  algorithms  GE  and  E  have  a  lower  order  of  com¬ 
plexity  in  running  time,  and  thus  constitute  rea¬ 
sonable  alternatives  for  algorithm  L.  The 
following  two  theorems  provide  additional  support 
for  this  argument  in  terms  of  accuracy  under  cer¬ 
tain  assumpt i ons . 

Theorem  1  :  If  the  process  graph  is  a  tree  then 
algorithm  GE  minimizes  the  communication  cost. 

Proof :  The  proof  directly  follows  from  the  con¬ 
sideration  of  tho  fact  that  a  tree  becomes  discon¬ 
nected  by  removing  any  of  its  arcs.  For  an  optimum 
P-way  partitioning  wo  just  need  to  remove  those 
P-1  arcs  that  have  the  minimum  weight.  Algorithm 
GE  essentially  produces  this  effect  by  fusing  the 


1.  Calculate  the  overage  load  level  L  for  the 
processors .  Then  modify  the  original  process  graph 
as  follows: 
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nod*  pair  with  maximum  communication  at  each 
iteration. QEO. 

Theorem  2;  If  the  process  graph  is  a  trc*  with  P 
distinguished  nodes  in  it»  then  algorithm  E  finds 
the  minimum  communication  partitioning  that  sepa¬ 
rates  the  distinguished  nodes.  The  proof  is  given 
in  £31. 

As  a  corollary  result  of  theorem-2  it  is  easy 
to  see  that  algorithm  GE  also  finds  the  minimum 
communication  partitioning  for  a  tree  with  distin¬ 
guished  nodes  in  it.  The  significance  of  these 
theorems  becomes  apparent  when  we  consider  the 
fact  that >  to  detect  whether  a  process  graph  is  a 
tree,  all  we  need  is  to  check  if  £=V-1.  In  such  a 
ease  much  simpler  algorithms  can  be  derived  as  im¬ 
plied  by  the  theorem- 1.  Below  we  describe  the  sim¬ 
ulation  results  obtained  from  comparing  algorithms  * 
GE  and  E. 

Simulation 

For  simulation,  we  implemented  algorithms  6E 
and  E  in  Pascal  and  tested  their  performance  using 
randomly  generated  graphs  in  which  the  number  of 
nodes,  V,  varied  between  10  and  100.  The  number 
of  processors  were  between  2  and  20  (or  V/3, 
whichever  is  smaller).  Altogether  we  used  250  ran¬ 
dom  graphs.  Concerning  the  accuracy  of  results, 
without  the  real-time  constraints,  we  observed  no 
significant  difference  between  the  two  algorithms. 
For  76k  of  the  cases  the  results  were  either  opti¬ 
mum  or  within  less  than  20k  difference  of  the  op¬ 
timum.  More  details  of  these  simulations  are  given 
in  [3] . 

The  running  times  are  measured  on  Amdahl  580. 
Table-2  gives  a  typical  running  time  measurement 
for  the  two  algorithms.  Each  entry  in  this  table 
shows  the  running  time  for  a  total  of  25  problems 
with  a  given  number  of  nodes.  Algorithm  E  is  mar¬ 
ginally  faster  than  GE  due  to  the  fact  that  a  node 
obtained  after  fusion  is  very  likely  to  have  an 
arc  with  "large"  weight.  In  such  a  case,  the 
search  time  for  the  next  iteration  is  minimized. 

RUN-TIME  (MILLISECONDS  /  25  PROBLEMS) 


t  OF  NOOES 

Algorithm  GE 

Algori 

10 

152 

144 

20 

483 

397 

30 

995 

880 

40 

1470 

1125 

60 

4763 

3960 

eo 

6402 

5621 

100 

9175 

7475 

Table  Z 


Me  also  compared  the  success  of  the  two  algo¬ 
rithms  in  meeting  a  real-time  constraint.  We  set 
an  upper  bound  L  on  the  completion  of  execution  as 

L  =  L(ave)  ♦  T 

where  L(ave)  is  the  average  running  time  over  P 
processors  and  T  is  the  tolerance  calculated  as 
20k  of  L(ave).  With  this  upper  bound,  on  assign¬ 
ment  is  "successful"  if  all  of  the  modules  are  as¬ 
signed  to  the  given  P  processors.  Algorithm  GE 
performed  much  worsa  than  ue  expected.  For  large 
graphs  ( i .a.  V>20)  it  failed  almost  always.  In  a 
set  of  25  graphs  each  with  10  nodes,  only  7  cases 


produced  a  successful  assignment.  Figure  6  shows 
a  typical  case  when  this  algorithm  fails. 

Algorithm  E  produced  successful  assignments  in 
72X  of  the  tested  cases.  When  it  did  not  succeed, 
it  was  because  the  process  graph  was  disconnected 
after  step-1  of  the  modification  algorithm.  Also 
for  successful  assignments  a  certain  cost  increase 
has  been  observed  due  to  reassignraents.  This  cost 
increase  was  generally  less  than  50k  for  V<40, 
while  up  to  100k  increases  wera  observe.)  for 
larger  problems. 


Figure  6:  (a)  Process  graph,  (b)  After 
four  fusions  failure  occurs  if  P  =  3 
and  L  =  4. 


-V  CONCLUSIONS  AND  FUTURE  DIRECTIONS 

This  research  shows  that  the  trodc-off  between 
Using  optimum  but  slow  algorithms  as  opposed  to 
sub-optimum  but  efficient  ones  heavily  favors  the 
heuristics  when  one  considers  on-line  assignment 
of  tasks  to  distributed  processors.  This  argument 
is  strengthened  partly  due  to  the  success  of 
heuristic  algorithms  in  finding  near  optimal  sol¬ 
utions,  and  partly  by  the  fact  that  an  optimal 
model,  is  nothing  more  than  on  approximation  of  the 
real  world. 

As  we  have  seen  in  this  paper,  different  algo¬ 
rithms  have  their  strengths  and  weaknesses.  Taking 
algorithmic  efficiency  as  the  paramount  goal,  the 
following  strategy  can  be  adapted  in  designing  an 
algorithm  to  combine  the  best  ideas  in  the  papers 
we  considered: 

1.  Use  algorithm  GE  or  E  for  minimizing  communi¬ 
cation  costs 

2.  Use  algorithm  E  for  real-time  constraint 

3.  Use  algorithm  L  if  processor  speeds  are  dif¬ 
ferent. 
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Ncvcrihelftsit  this  algorithm  will  only  provide 
*  sub-epti mal  solution  to  a  rather  small  propor¬ 
tion  of  the  problems  in  task  assignment.  One  im¬ 
portant  problem  that  remains  is  tho  consideration 
of  precedence  constraints  between  modules.  Also 
on-line  calculation  of  the  cost  of  module  exe¬ 
cution  and  communication  are  open  problems,  al¬ 
though  some  limited  success  has  bean  reported 
[6,7]. 
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Abstract 

Is  reusable  software  a  myth,  magic  or  a  messiah?  This  paper  examines  the  issue  of  why  the 
paradigm  of  reusable  software  engineering  has  not  had  the  broad  sweeping  effects  envisioned  by 
the  programming  prophets.  Parenthetically,  the  proverbial  question  "What  makes  reusing 
software  artifacts  difficult?-  is  raised  and  answered  from  the  points  of  view  of  a  programmer, 
software  manager,  computer  scientist,  and  cognitive  psychologist.  Technical,  organizational, 
political,  and  psychological  bottlenecks  are  identified. 


1.  Introduction 

There  it  nothing  new  under  the  tun. 


Ecclesiastes  1:9 


The  concept  cf  rentable  toftware  has  been  part  of  our  programming  heritage  since  the  origins  of 
the  stored  program  computer  EDSAC  at  the  University  of  Cambridge  in  1949.  Maurice  Wilkes1 
first  recognized  the  need  for  avoiding  the  redundant  effort  in  writing  scientific  subroutines  and 
recommended  that  a  library  of  routines  be  kept  for  general  use.  Until  recently,  little  had  been 
done  to  embellish  this  concept  of  program  reusability  over  the  35  years  that  have  passed.  The 
formidable  "software  crisis"  coupled  with  impressive  improvements  in  the  price  to  processing 
power  ratio,  advances  in  programming  language  design,  compiler  construction  and  interactive 
graphics  has  forced  developers  to  reevaluate  the  tradeoffs  made  in  establishing  the  traditional  ad 
hoc  development  methodologies  and  environments  used  in  the  past.  New  and  better  ways  are 
being  explored  to  harness  these  recent  technological  advances  and  to  develop  an  integrated 
software/hardware  system  optimized  for  programmer  productivity.  Again"  and  again3  the  role 
<>{  Reusable  Software  has  been  identified  and  discussed 
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This  paper  examines  the  issue  of  why,  to  date,  the  paradigm  of  reusable  software  engineering  has 
not  had  the  broad  sweeping  effects  envisioned  by  the  programming  prophets.  Parenthetically, 
the  proverbial  question  wWhat  makes  reusing  software  artifacts  difficult?”,  is  raised,  and 
answered  from  the  perspective  of  a 

•  Programmer, 

•  Software  Manager, 

•  Computer  Scientist,  and 

•  Cognitive  Psychologist. 

The  goal  of  this  paper  is  to  identify  the  technical,  organizational,  political,  and  psychological 
impediments  that  have  inhibited  reusable  software  from  being  more  prevalent  in  the  state  of  the 
practice.  Further  discussion  of  these  issues  and  their  solutions,  both  implemented  and  proposed, 
may  be  found  in  a  companion  paper9. 

The  remaining  portion  of  this  paper  is  organized  into  two  sections.  The  first  contains  a 
discussion  of  the  various  viewpoints  regarding  reusability.  The  second  section  contains  the 
summary  and  conclusion. 

2.  Different  Perspectives 

If  you  are  not  part  of  the  solution,  then  you  may  be  part  of  the  problem.  -  Anon. 

”What  makes  reusing  software  artifacts  difficult?”  The  answer  to  this  question  manifests  itself 
in  many  technical,  organizational,  political  and  psychological  issues.  This  section  contains  a 
discussion  of  the  inhibitors  and  facilitators  identified  with  reusable  software  presented  from  four 
points  of  view: 

•  A  Programmer:  someone  who  designs,  implements  and  tests  a  portion  of  a  software 
system. 

•  Software  Manager:  someone  who  manages  a  software  development  project. 

•  Computer  Scientist:  someone  on  the  leading  edge  of  technology,  exploring  and 
developing  new  techniques  for  expanding  the  reusable  software  engineering  paradigm. 

•  Cognitive  Psychologist:  someone  who  understands  the  human  thought  process,  its 
limitations  and  implications  on  programming. 

2.1.  A  Programmer’s  Viewpoint 

What  are  some  of  the  reasons  a  programmer  doesn’t  use  someone  else’s  code  or  design’ 

•  It  is  more  fun  to  write  it  oneself 

•  The  Not  Invented  Here  syndrome  of  making  oneself  indispensable,  or  fear  that  in 
reusing  someone  elses  code,  it  would  be  showing  signs  of  weakness  in  not  being  able  to 
do  it  oneself. 

•  It  is  easier  to  write  it  oneself,  then  to  try  and  find  it,  figure  out  what  it  does,  and  if  it 

I  72 


w. 


works.  Furthermore,  what  are  its  attributes  (performance)  and  dependencies 
(restrictions)?  If  it  has  to  be  modified,  then  it  also  might  be  easier  to  rewrite  it. 

•  There  are  no  tools  to  help  find  components,  or  compose  a  system  from  the  reusable 
pieces. 

•  There  are  no  know  software  development  methodologies  that  stress  reusing  code,  let 
alone  reusing  a  design,  or  a  specification? 

•  Little  emphasis  in  reusing  components  is  taught  in  academia10,  in  fact,  most  students 
don’t  have  any  mechanism  or  motivation  to  save  programs  from  assignment  to 
assignment,  let  alone,  from  course  to  course. 

The  technical  issues  raised  here  focus  on  the  lack  of  well  described,  useful,  and  reliable  reusable 
component  libraries  and  an  integrated  programming  environment  available  to  take  advantage  of 
them.  On  a  more  philosophical  note,  the  reluctance  of  a  programmer  to  re-tool  and  place  a 
dependence  on  someone  else’s  work  generally  inhibits  initial  acceptance  of  this  approach. 

2.2.  A  Manager’s  Viewpoint 

Managers  often  make  decisions  that  are  not  based  on  technical  issues  alone.  Some  reasons  for 
not  adopting  a  Reusable  Software  Engineering  approach  for  a  software  project  might  be  as 
follows: 

•  If  no  tools  or  components  exists,  then  it  will  take  time  and  manpower  to  create  the 
tools  and  the  expertise.  Such  cost  are  generally  not  within  the  budget  of  a  single 
project. 

•  If  the  tools  truly  do  exists  for  making  programmers  more  productive,  then  that  will 
make  the  project  more  dependent  on  fewer  personnel,  consequently  increasing  the 
risk,  and  decreasing  the  number  of  people  on  the  project  (or  reducing  the  empire  a 

O 

manager  commands)  . 

•  Are  the  reusable  tools  deliverable  items?  Does  the  customer  expect  to  need  them  to  do 
maintenance? 

•  How  does  one  set  and  maintain  standards  to  control  what  is  entered  into  the 
components  library11. 

Technical  issues  faced  by  management  are  sometimes  tainted  by  political  considerations,  or 
personal  aspirations.  Nevertheless,  budgeting,  scheduling  and  managing  a  software  project  based 
on  a  reusable  components  library  requires  a  certain  amount  of  confidence  and  experience  in  the 
methodology. 

2.3.  A  Computer  Scientist’s  Viewpoint 

A  computer  scientist  might  take  a  more  far  reaching  perspective  when  facing  the  issue  of 
reusability.  Balzer'  has  stated  that  "Code  is  not  reusable",  suggesting  instead  of  the  black  box, 
plug  compatible  approach  focused  on  programming  product «,  the  answer  to  reusable  software  lies 


in  analyzing  the  programming  proce»».  From  this  perspective,  the  following  alternative 
approaches  to  reusability  have  been  suggested:5 

•  Very  High  Level  Languages  (VHLL’s)  that  allow  specification  of  problem  domain 
entities  and  operations  directly  in  the  syntax  of  the  language.  Also,  Problem 
Oriented  Languages  (POL’s)  are  a  form  of  VHLL’s  that  are  specifically  tailored  for  a 
particular  problem  domain.  Reusability  is  accomplished  by  reusing  the  compiler. 

•  Application  generators  are  software  tools  that  create  programs  given  a  parameterized 
or  programmed  specification.  Reusability  is  again  accomplished  by  reusing  the 
application  generator  for  each  new  problem. 

•  Transformation  Systems  require  high  level  specifications  be  written  describing  what 

the  software  system  should  do.  The  specifications  are  then  transformed  by  a  series  of 

12 

pattern  matching  expansions  into  a  program  . 

The  key  concept  in  each  of  these  three  examples  focuses  on  the  automated  application  of 
reusable  components.  Each  tool  recognizes  some  type  of  high  level  pattern  in  the  problem 
domain  which  can  be  implemented  by  substituting  some  (parameterized)  code  fragments. 
Certain  theoretical  limits  of  the  transformational  techniques  have  been  investigated.  Another 

10  1i 

approach  uses  formal  specifications  for  describing  each  component  ’  . 


2.4.  A  Cognitive  Psychologist’s  Viewpoint 

Computer  programming  is  one  form  of  problem  solving.  Much  insight  has  been  gained  in 
understanding  the  merits  of  several  programming  paradigms  from  the  perspective  of  cognitive 
psychology15.  Reusable  software  has  been  the  focus  of  studies  by  Soloway16  and  Curtis17.  A 
summary  of  the  empirical  evidence  gathered  as  it  applies  to  reusable  software  engineering  is  as 
follows: 

•  The  human  thought  process  is  limited,  by  the  size  of  Short  Term  Memory,  to  the 

number  of  pieces  of  information  it  can  manipulate  consciously  at  one  moment  in  time 
(7+/-2)18.  This  complexity  limit  can  be  overcome  by  proper  chunking  or 

modularization  of  components,  that  is,  collecting  units  of  information  together  into 
one  semantically  meaningful  piece  (or  package)  (an  argument  that  maps  nicely  into 
information  hiding  and  object  oriented  design19). 

•  Expert  (experienced)  programmers  develop  applications  through  a  recursive  mental 
process"0  of  matching  pieces  of  the  problem  with  solution  segments  which  they  are 
familiar  with  (plans  ). 

•  Internal  conceptualization  of  the  knowledge  base  in  which  program/design  segments 
reside  tends  to  evolve  with  experience,  into  having  a  uniform  content  for  all 

OO 

programmers"".  In  other  words,  experienced  programmers  tend  to  think  alike,  and 
express  their  solutions  in  similar  forms. 

•  Programmers  cannot  reuse  something  they  don't  understand.  Furthermore,  expert 
programmers  follow  certain  explicit  rules  of  discourse  regarding  naming 
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conventions  aied  programming  style  which  enhance  program  readability  and 
comprehension. 

These  results  support  the  need  for  a  proper  programming  environment  to  facilitate  the  reusable 
software  engineering  paradigm.  Tools  must  be  available  to  handle  the  complexity,  and  assist  the 
programmer  in  finding  and  understanding  what  software  components  exist.  These  results  also 
demonstrate  the  intuitive  validity  of  such  an  approach. 

3.  Conclusion 

Those  who  cannot  remember  the  past  are  condemned  to  repeat  it  -  George  Santayana 

This  paper  has  described  the  difficulties  that  arise  when  attempting  to  reuse  software  artifacts. 
The  major  issues  may  be  summarized  as: 

•  Most  programmers  tend  to  view  reusability  from  the  perspective  of  simply  reusing 
code  when  reusing  other  programming  artifacts  (designs,  specifications,  and  tests) 
leads  to  a  more  productive  environment.  Furthermore,  other  reusability  paradigms 
(Application  Generators,  Translation  systems,  VHLLs,  and  POLs)  have  proven 
successful5. 

•  Useful,  properly  documented,  tested,  verified  and  classified  reusable  components  need 
to  be  developed  before  they  can  be  reused.24,7 

•  Uxpert  (experienced)  programmers,  with  an  understanding  of  the  problem  domain, 
and  component  library  are  best  suited  for  fully  exploiting  the  reusable  software 

1  fi 

engineering  paradigm. 

•  Tools  and  methodologies  need  to  be  developed  to  support  the  development  and 
cataloging  of  reusable  components  and  the  composition  of  software  systems  from 
them.11 

•  There  are  staffing  risks  associated  with  a  component  based  approach  due  to  increased 

Q 

dependence  on  a  single  individual  to  do  the  work  of  many. 

•  Reusable  software  development  systems  cost  money,  time,  and  manpower  to  develop 
and  become  proficient  at  using. 

The  results  should  in  no  way  be  interpreted  as  being  insurmountable  as  various  successful 
production  quality  systems25,26,27,28  and  prototypes29,12,30,14,  have  been  implemented,  or 

0  4  Ol  O 

proposed.  In  particular  work"  in  Ada"  is  very  promising.  Further  discussion  of  these  systems 
is  not  within  the  scope  of  this  paper  and  may  be  found  in9. 


2 
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Abstract 

One  of  the  best  ways  to  reduce  the  cost  of  producing  software  is  to  reuse  old 
software  instead  of  writing  it  anew.  Although  reusing  small  pieces  of  programs 
has  long  been  practiced  and  software  packages  are  also  reused  in  their  entirety, 
other  products  of  the  software  design  process,  such  as  the  general  structure  of 
a  program  or  the  design  itself,  are  not  formally  reused.  This  paper  discusses 
a  new  approach  to  the  design  and  implementation  of  software  systems.  In  this 
approach,  a  design  is  a  formal  mathematical  object  which  can  be  instantiated 
as  an  object-oriented  implementation.  This  allows  reuse  of  both  designs  and 
implementations. 


1  Introduction 

The  skyrocketing  costs  of  software  production  are  not  helped  by  the  fact  that  new 
software  development  often  seerns  to  require  implementation  “from  the  ground  up” 
despite  the  fact  that  there  are  usually  many  parts  of  other  systems  that  are  similar 
to  (but  not  the  same  as)  portions  of  the  new  system  under  development.  Also  there 
seems  to  be  a  logical  and  philosophical  divergence  between  the  design  of  most  systems 
and  their  implementations;  designs  are  used  to  map  out  a  system  but  frequently  there 
is  little  correlation  between  the  design  and  the  finished  product.  Certainly  there  is 
no  way  to  measure  this  correlation  and  no  way  to  find  out  what  decisions  were  made 
in  going  from  the  design  to  an  implementation. 

We  need  to  gain  a  better  understanding  of  the  design  phase,  and  also  a  better 
understanding  of  the  implementation  phase.  We  propose  two  orthagonal  approaches 
to  this:  first,  a  formal  model  of  design;  and  second  an  incremental  approach  to 
the  implementation  of  designs,  where  a  hierarchy  that  contains  history  information 
concerning  the  program  development  is  encoded.  The  advantage  of  formally  under¬ 
standing  design  is  that  we  reduce  the  chance  of  design  errors  and  the  chance  of  design 
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alterations  that  have  severe  impact  on  software.  The  advantage  of  hierarchy  in  the 
implementation  process  is  that  we  can  determine  the  correlation  between  design  and 
implementation  and  also  identify  the  minimum  work  required  to  reuse  and  retarget 
software. 

The  body  of  this  paper  consists  of  two  sections.  In  the  first  we  discuss  our  notion 
of  formal  support  for  software  design  and  in  the  second  we  discuss  our  hierarchical 
model  of  software  development.  This  model  is  closely  akin  to  the  class  hierarchy  of 
object-oriented  systems  with  class  inheritence. 

2  Formal  Support  for  Software  Design 

Engineering  any  system  has  two  components;  a  creative  component,  and  a  formal 
component.  The  creative  component  is  the  intuition,  insight  and  ideas  that  go  into 
the  design.  The  formal  component  is  the  checking  of  the  design  for  robustness, 
completeness  and  soundness.  These  two  components  are  inextricably  intertwined; 
as  the  designer  is  working  creatively,  so  he  is  bearing  in  mind  the  formal  analyses 
which  are  to  come,  and  trying  to  anticipate  and  prevent  any  flaws  from  entering  the 
design  which  would  be  detected  by  the  formal  analysis.  A  better  understanding  of  the 
formal  aspects  of  software  design,  based  on  discrete  mathematics,  will  allow  a  similar 
approach  to  the  engineering  of  software  systems  -  a  creative  process  backed  up  by 
the  ability  to  construct  formal  models  of  designs  and  reason  about  their  correctness. 

The  major  design  activity  is  decomposing  a  problem  (i.e.  a  specification  for  a 
system)  into  a  set  of  smaller  problems  and  interconnecting  the  solutions  to  these 
smaller  problems  to  form  a  solution  to  the  original  problem.  A  designer  recursively 
decomposes  a  high-level  design  into  specifications  for  smaller  and  smaller  components 
until  the  remaining  components  can  be  easily  implemented  or  can  be  reused  from 
some  other  system.  Thus,  a  design  describes  the  decomposition  of  a  system. 

Since  the  final  implementation  is  described  by  a  programming  language,  it  is 
attractive  to  use  the  same  language  to  describe  the  design,  e.g.  using  Ada1  as  a 
design  language.  However,  this  artificially  makes  designs  language-specific,  uses  a 
notation  that  is  difficult  to  analyze  formally,  and  uses  the  programming  language  for 
purposes  for  which  it  is  usually  ill-suited.  Other,  more  intuitive,  design  techniquesfllj 
lack  a  basis  for  formal  rigor,  as  do  most  of  the  module  interconnection  languages)  10|. 
We  consider  the  components  of  a  software  system  to  be  algebras  (sets  of  functions  on 
specific  data  types),  and  describe  them  with  the  language  of  discrete  mathematics, 
in  the  style  of  (l|. 

Consider  the  system  pictured  in  figure  1.  In  the  algebraic  view,  the  circles  repre¬ 
sent  algebras.  Each  algebra  supplies  a  specific  part  of  the  functionality  of  the  total 
system,  by  supplying  a  set  of  operations  on  a  specific  set  of  data  types.  In  Ada,  these 
algebras  would  be  implemented  as  packages.  The  operations  would  be  procedures 
or  functions  and  the  data  types  will  be  Ada  types  or  more  complex  types  supplied 
by  other  packages.  The  Ada  package  specification  corresponds  to  the  definition  of 
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Figure  1:  Schematic  Representation  of  Component  Interconnection 


the  signature  of  an  algebra,  and  the  package  body  corresponds  to  an  implementation 
of  the  algebra.  Ada  does  not  provide  the  ability  to  describe  the  semantics  of  the 
algebra  as  part  of  the  package  specification.  Languages  such  as  Anna  [7]  would  be 
employed  to  do  this.  The  arrows  between  algebras  represent  potential  communica¬ 
tions  between  the  algebras.  For  example,  the  compiler  could  be  able  to  read  source 
files  from  the  database  and  write  object  files  to  it. 

Having  constructed  this  model  of  a  design,  we  can  ask  questions  of  it,  such  as. 
Are  all  the  software  objects  well  defined?  Is  it  possible  for  the  components  of  the 
system  that  are  supposed  to  communicate  to  do  so?  An  algebra  is  well  defined  we 
can  give  a  meaning  to  any  sentance  constructable  from  the  signature  [1],  The  data 
values  produced  by  one  algebra  can  be  used  by  another  if  there  is  a  function  that 
maps  elements  from  selected  carriers  of  the  first  algebra  to  carriers  of  the  second. 
Thus,  we  can  answer  a  number  of  questions  about  designs  by  translating  them  to 
questions  about  the  equivalent  algebras. 

There  are  some  questions  about  designs  that  can  be  answered  by  considering 
the  general  model  of  design  representation  that  we  have  constructed  without  any 
reference  to  specific  designs.  For  example,  most  software  development  environments 
use  some  universal  language  for  communication  between  tools  in  the  environment, 
such  as  DIANA.  Investigation  into  the  formal  structure  underlying  our  design  model 
has  suggested  that  universal  languages  should  not  be  used  for  this  purpose  as  they 
needlessly  limit  the  flexibility  and  reusability  of  tools  which  rely  on  them.  (DIANA, 
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for  example,  used  an  attribute-grammar  based  paradigm.  Attribute-grammars  have 
not  yet  been  shown  to  be  efficient  enough  for  future  software  development  tools). 
Instead  of  a  universal  language  there  should  be  a  (small)  number  of  communica¬ 
tions  mechanisms  which  allow  communication  between  tools  and  support  conversion 
between  the  types  used  by  the  various  tools  in  the  environment. 

Because  the  design  objects  are  algebras,  they  are  often  themselves  executable(4j. 
This  means  that  we  can  test  designs  by  executing  them.  We  can  also  test  their  correct 
interaction  by  using  systems  that  test  module  interconnection  such  as  Polylith  [8j. 
This  is  obviously  advantageous  to  the  software  engineer,  since  the  costs  of  rectifying 
design  errors  once  a  system  has  been  constructed  are  high. 

Once  we  are  satisfied  with  a  design,  an  implementation  can  be  constructed.  An 
approach  based  on  category  tAeory[6](5]  allows  us  to  reason  formally  about  the  in¬ 
stantiation  of  an  implementation  from  a  design.  The  design  is  viewed  as  a  blueprint 
for  the  implementation.  Coupling  the  design  and  implementation  closely  in  this  way 
has  several  advantages  from  the  software  engineering  viewpoint.  First,  some  aspects 
of  the  implementation  process  can  be  automated,  such  as  automatic  choice  of  control 
structures  by  analysis  of  the  specifications.  Second  we  can  record  implementation 
decisions  as  they  are  made  when  instantiating  the  implementation  from  the  design 
(such  as  choice  of  data  representations). 

We  believe  that  “software  reuse”  and  “design  reuse”  are  inherently  inextricable. 
In  our  model  of  program  development,  design  and  implementation  are  very  closely 
bound.  The  fundamental  focus  for  reuse  should  be  design  reuse.  Each  component  of 
a  design  has  an  implementation  (or  possibly  several  implementations)  closely  bound 
to  it,  together  with  a  history  of  how  those  implementations  were  derived  from  the 
design.  Each  component  may  also  have  a  design,  which  will  be  part  of  the  derivation 
history  of  the  component.  By  reusing  design  components,  one  automatically  gets  the 
implementations.  These  implementations  can  be  reused  if  the  design  is  unaltered, 
or  rapidly  modified  if  the  design  is  changed. 

3  Object-Oriented  Realizations  of  a  Formal  Design 

An  object-oriented  implementation  is  natural  for  our  design  blueprints.  The  circles 
and  arrows  in  the  design  are  objects  in  the  implementation,  where  the  objects  which 
represent  the  circles  provide  the  required  functionality  of  the  system  and  the  objects 
which  represent  the  arrows  are  messages  between  the  functional  objects. 

One  method  of  object-oriented  programming  is  that  used  in  Ada.  Here,  packages 
are  used  to  encapsulate  objects,  providing  information  hiding,  abstraction  and  the 
functionality  of  an  object  in  a  design.  Because  communication  between  objects  is 
limited  to  procedures  described  in  interface  specifications,  clean  interfaces  between 
program  components  are  more  likely  to  arise  than  if  the  program  were  designed  as 
a  set  of  subroutines.  Much  of  the  design  of  a  system  is  expressed  in  the  interface 
specifications  of  its  packages,  since  the  interface  specifications  indicate  which  pack¬ 
age  implements  each  function  that  the  system  is  required  to  implement.  This  is 
why  designs  and  implementations  can  be  tightly  bound;  the  package  specification 


describes  the  signature  of  the  design  object  (i.e.  algebra)  which  it  implements,  while 
the  package  body  describes  the  algebra  itself  by  means  of  a  concrete  implementation. 
This  style  of  programming  is  supported  not  only  by  object-oriented  programming 
languages,  but  by  many  other  modern  programming  languages. 

Many  object-oriented  programming  languages  provide  polymorphic  functions  (pro¬ 
vided  to  a  limited  extent  in  Ada  by  the  overloading  and  generic  facilities  of  that  lan¬ 
guage)  which  greatly  increase  the  potential  for  code  reuse.  Because  objects  in  these 
languages  interact  only  by  s<nding  messages  to  each  other,  objects  depend  only  on 
each  other’s  signatures,  i.e.  'he  set  of  operations  that  they  have  defined.  The  im¬ 
plementation  of  the  operations  for  any  object  is  defined  by  that  object’s  class.  For 
example,  a  printer  object  can  print  any  object  whose  class  has  defined  an  operation 
which  returns  a  printable  representation  of  the  object. 

Polymorphism  is  most  useful  when  there  are  families  of  classes  with  similar  signa¬ 
tures.  Programmers  tend  to  implement  those  operations  necessary  to  let  new  classes 
be  used  with  existing  polymorphic  functions,  resulting  in  these  families  of  classes. 
This  not  only  makes  implementation  easier  but  makes  the  resulting  code  easier  to 
reuse.  Thus,  polymorphism  increases  the  reuse  of  both  code  and  design. 

Many  object-oriented  languages  encourage  the  creation  of  families  of  classes  that 
support  similar  operations  by  allowing  one  class  to  inherit  the  operations  and  data- 
structures  defined  by  another.  Families  of  classes  form  a  hierarchy,  each  with  a  class 
that  defines  the  operations  common  to  the  family.  Often  the  top-most  class  defines 
operations  so  abstractly  that  they  are  unreasonably  inefficient  or  even  not  executable, 
but  the  subclasses  that  make  up  the  family  redefine  operations  and  pick  specialized 
representations  for  data  in  order  to  reach  an  acceptable  level  of  efficiency.  The  class 
inheritance  hierarchy  records  the  choices  in  data  representation  that  were  made  and 
the  algorithmic  improvements  that  resulted.  These  are  important  design  decisions 
that  are  usually  never  explicitly  recorded  in  most  other  programming  styles.  This 
is  vital  for  software  reuse.  If  one  takes  a  design  and  associated  implementation,  and 
makes  a  change  to  the  design,  then  changes  to  key  points  in  the  inheritance  hierarchy 
can  retarget  the  implementation  for  the  new  design. 

The  top-most  class  of  a  family  sometimes  acts  as  a  code  template,  further  en¬ 
couraging  code  reuse.  It  may  leave  a  few  crucial  operations  undefined,  but  define 
a  great  many  more  operations  in  terms  of  the  few  undefined  ones.  Subclasses  will 
define  the  missing  operations  and  so  be  able  to  use  the  many  operations  defined  by 
the  top-most  class.  Thus,  by  defining  a  few  operations  for  a  new  class,  a  program¬ 
mer  can  make  use  of  many  other  operations.  If  a  new  operation  is  defined  for  the 
top-most  class  then  its  subclasses  immediately  inherit  that  operation.  This  contrasts 
with  the  ad-hoc  way  in  which  code  templates  are  usually  implemented. 

Another  way  in  which  code  templates  are  useful  is  to  perform  the  refinements  of 
an  object  (the  creation  of  lower  levels  in  the  hierarchy)  using  an  abstract  programming 
language.  This  means  that  a  development  can  be  retargeted  to  different  programming 
languages  by  instantiating  the  abstract  program  into  different  concrete  programming 
languages  (such  as  Ada,  Modula-2  or  CLU). 

Object-oriented  programming  languages  have  a  reputation  for  encouraging  code 


reuse[9].  This  is  in  part  because  they  encourage  data  abstraction,  partly  because  the 
class  hierarchy  allows  designs  to  be  recorded,  and  partly  because  of  the  programming 
environments  of  which  they  are  usually  a  part.  However,  these  languages  are  usually 
designed  for  the  unconstrained  style  of  artificial  intelligence  programming  rather 
than  the  more  rigid  demands  of  software  engineering. 

Some  work  has  been  done  on  improving  the  object-oriented  languages[3j,  but  it  is 
also  possible  to  add  class  hierarchies  to  languages  designed  for  software  engineering. 
A  database  that  stores  multiple  versions  of  a  program  can  be  used  to  let  a  language 
like  Ada  mimic  the  class  hierarchy  of  an  object-oriented  language.  The  superclass- 
subclass  relationship  can  be  represented  by  the  old  version-new  version  relationship. 
A  subclass  inherites  all  the  operations  of  its  superclass,  but  adds  new  operations  and 
changes  some  of  the  old  operations.  In  the  same  way,  the  new  version  of  a  package 
can  add  new  operations  and  change  old  operations.  This  technique  is  likely  to  use 
multiple  versions  of  a  package  at  the  same  time,  but  few  version  control  systems 
permit  several  versions  of  an  object  to  be  used  simultaneously.  This  suggests  that  a 
specialized  version  control  system  might  be  needed  to  fully  support  class  hierarchies 
in  languages  of  which  they  are  not  normally  a  part. 

4  Future  Research 

Most  research  into  algebraic  abstract  data-types  has  been  primarily  theoretical,  with 
a  few  notable  exections[4]|2].  We  are  investigating  how  that  theory  can  be  used  to 
solve  the  important  design  problems  in  software  engineering,  such  as  those  discussed 
above.  By  gaining  an  insight  into  the  formal  relationships  between  components  or  a 
design  we  get  a  better  idea  of  just  how  to  perform  the  decomposition  process.  By 
understanding  design  decomposition  we  understand  better  how  to  specify  and  use 
modularity,  for  improved  program  development  productivity,  reliability  and  reusabil¬ 
ity  and  retargetability.  All  of  these  are  intricately  coupled  issues. 

Most  work  with  object-oriented  programming  has  been  by  programming  lan¬ 
guage  designers  or  by  those  interested  in  applying  object-oriented  programming  to 
particular  areas  such  as  artificial  intelligence,  office  automation,  or  computer  aided 
design.  We  are  investigating  object-oriented  programming  from  the  software  engi¬ 
neering  viewpoint  and  expect  to  use  it  to  record  design  decisions  and  enhance  code 
reuse. 

In  particular,  we  are  investigating  the  following  problems: 

1.  Which  design  problems  can  be  solved  using  techniques  borrowed  from  discrete 

math? 

2.  Which  of  the  many  methods  for  describing  algebras  form  the  best  basis  for 
specification  of  programs? 

3.  How  does  our  design  technique  influence  the  design  of  a  module  int  erconnect  ion 
language 


4.  How  can  our  design  technique  be  used  with  modern  programming  languages 
developed  for  software  engineering? 

5.  How  can  we  best  use  the  ability  of  a  class  hierarchy  to  record  design  decisions 
and  provide  code  templates? 

6.  What  are  design  principles  for  class  hierarchies  that  maximize  code  reuse? 

7.  How  should  programming  environments  be  changed  to  support  and  take  ad¬ 
vantage  of  object-oriented  programming? 

We  believe  that  our  hybrid  approach  to  these  problems  -  combining  research  into 
formal  and  informal  issues  -  will  allow  us  to  gain  insight  into  solutions  to  pressing 
software  design  and  implementation  issues  and  apply  this  insight  to  the  development 
of  practical  solutions  to  these  problems. 
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ABSTRACT 

The  "software  crisis"  has  made  firms  who  depend  on  hardware  and  software  systems  for 
their  livelihood  (read  "everybody")  keenly  aware  of  the  need  to  improve  programmer 
productivity  and  successfully  capture  the  knowledge  gained  on  previous  software  development 
projects.  Several  government  and  industry  initiatives  (Software  Technology  for  Adaptable, 
Reliable  Systems  (STARS)  and  the  Software  Productivity  Consortium  (SPC),  respectively)  have 
focused  on  software  reusability  as  one  solution  to  this  problem. 

DARTS™,  a  proprietary  software  engineering  environment  developed  by  General  Dynamics' 
Data  Systems  Division,  provides  a  unique  answer  to  the  reuse  question:  archetyping,  the  use  of 
"embedded”  high-order  language  statements  and  an  advanced  macro- processing  capability  to 
automatically  generate  new  versions  of  complex  software  systems.  This  paper  will  provide  an 
overview  of  the  technology  and  archetyping,  including  some  examples  of  archetyped  code.  It  will 
also  discuss  other  knowledge-based  applications  of  DARTS™. 

BACKGROUND 

The  Development  Arts  for  Real-Time  Systems  (DARTS™)  Technology  began  development  in 
mid-1979  as  a  General  Dynamics  Data  Systems  Division  software  initiative  to  try  to  solve  the 
software/programmer  productivity  problem.  DARTS™  is  a  knowledge-based  software 
engineering  technology  conceived  and  developed  to  implement  a  Software  Manufacturing  System 
(SMS).  This  system  allows  domain  experts  (e.g.,  missile  designers,  C3l  system  designers,  etc.) 
to  generate  new  versions  of  complex  software  systems  without  programmer  intervention  using 
specification  languages. 

The  implementation  of  this  software  manufacturing  concept  required  a  new,  very 
high-order  language  incorporating  many  high-technology  software  approaches  (e.g.,  pattern 
matching,  dynamic  knowledge  representation,  list  processing,  set  processing,  embedded 
languages,  dynamic  code  construction,  compilation  and  execution)  into  one  self-defining, 
extensible  language.  The  embodiment  of  these  needs  became  the  Archetype  Xenoclause  Embedding 
(AXE™)  Language. 


THE  CONCEPT  OF  "ARCHETYPING” 

The  basic  premise  of  the  SMS,  shown  in  Figure  1,  is  to  capture  existing,  fully  tested  and 
delivered  software  soon  after  its  completion.  The  same  set  of  software  engineers  who  developed 
the  original  system  then  "archetype"  it.  Archetype  comes  from  the  Latin  archetypum  for 
"first  molded  as  a  pattern,  exemplary."  In  this  case  it  refers  to  a  process  where  the  software 
specialists  work  along  with  users  of  the  system  (and  potential  users  of  similar  systems)  to 
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FIGURE  1  •  DARTS  IMPROVES  PRODUCTIVITY  THROUGH 

CAPTURE  OF  OOMAIN  ANO  PROGRAMMER  KNOWLEDGE 


sketch  out  possible  future  requirements  for  systems  of  this  type.  Thus,  this  team  of  experts 
develops  a  "pattern"  from  which  future  systems  can  be  generated,  hence  the  use  of  the  term 
"archetype." 

Archetyping  is  achieved  through  the  use  of  "xenoclauses"  (xeno-  ,  from  the  Greek  xenos 
for  "stranger"),  foreign  clauses  (or  statements)  "embedded"  within  the  native  code  which 
represent  areas  of  possible  change.  Xenoclauses  can  be  embedded  in  any  native  language.  The  code 
sections  that  replace  the  xenoclauses  to  effect  these  changes  are  stored  in  a  knowledge-base,  a 
very  flexible  storage  system  that  is  an  inherent  part  of  the  DARTS™  Technology.  This  explains 
the  reason  for  the  acronym  "AXE™." 

The  last  step  Is  identifying  these  changes  and  implementing  them  without  programmer 
intervention.  Software  engineers,  working  with  domain  experts,  develop  a  "specification 
language,"  a  set  of  terms  and  parameters  which  have  special  meaning  within  a  particular 
problem  domain.  This  implies  that  a  different  specification  language  must  be  built  for  each 
domain.  This  is  different  from  the  normal  software  engineering  concept  of  a  single  specification 
language  for  all  domains.  The  replacement  code  sections  are  stored  in  the  system 
knowledge-bases  under  access  criteria  referenced  by  the  specification  language. 


As  new  systems  are  required,  users  specify  their  requirements,  which  causes  archetypes 
(and  their  corresponding  knowledge-bases)  to  be  accessed  from  the  SMS  libraries.  These 
library  entries  and  the  user's  specification  form  the  inputs  to  DARTS™.  The  archetyped  files  are 
interpreted  by  the  BOLT™  Processor,  the  interpreter  of  DARTS™,  with  the  embedded  AXE™ 
Language  statements  replaced  by  code  sections  from  the  knowledge  bases. 

The  end  result  of  this  process  is  a  fully  functional  program  ready  for  compilation.  This 
newly  generated  code  meets  all  stated  specifications.  Since  this  new  version  is  based,  for  the 
most  part,  on  a  fully  tested  original,  debugging  time  is  minimized.  If  a  coding  error  recurs 
because  an  archetyped  code  section  is  in  error,  it  can  be  repaired  by  changing  the  knowledge  base 
and  regenerating  the  program. 

To  understand  how  the  technology  works,  it  is  necessary  to  provide  a  brief  outline  of  AXE™ 
language  syntax  and  usage. 

AXE™  LANGUAGE  CONCEPTS 

AXE™  language  statements  are  demarcated  by  a  less-than  ("<")  character  on  the  left  and  a 
greater-than  (“*")  symbol  on  the  right: 

<  {AXE™  statement)  > 

All  AXE™  statement  processing  either  replaces  a  statement  with  some  symbol  or  leaves  no 
symbol  behind  (replaces  a  statement  with  a  "null").  The  BOLT™  Processor  also  ignores  any 
characters  in  the  input  file  that  are  outside  the  boundaries  of  an  AXE™  statement  and  passes  them 
unchanged  to  the  transformed  input.  For  example,  suppose  the  <{AXE™  statement  for  a 
greeting }>  represented  the  word  "HELLO"  and  a  <{Statement  giving  a  person}>  represented 
"READER."  Then  the  BOLT™  Processor  would  convert: 

SAY  <{AXE™statement  for  a  greeting}>TO  THE<{Statement  giving  a  person}> 
into:  SAY  HELLO  TO  THE  READER 

AXE™  statements  can  be  continued  across  multiple  lines  using  the  "<$>"  and  "<$$>" 
statements.  "<$>"  means  "ignore  all  characters  to  the  right  of  me."  "<$$*"  means  "ignore  all 
characters  to  the  left  of  me."  These  are  useful  for  enhancing  readability  and  inserting  comments 
and  other  information  into  AXE™  code.  For  example,  the  sample  statement  used  above  could  also 
be  written: 

SAY  <{AXE™  statement  for  a  greeting}*  TO<$* 

<$$>  THE<{Statement  giving  a  person}* 


SYMBOLIC  REFERENCING 

AXE™  is  an  object-oriented  language,  i.e.,  all  pieces  of  information  can  be  accessed  as 
"metasymbols,"  symbolic  references  to  data,  lists,  sets,  procedures  or  just  about  anything. 
Unlike  variables  or  symbols  in  other  programming  languages,  metasymbols  can  have  more  than 
one  value,  depending  on  the  value  of  user-specified  access  conditions.  These  conditions,  called 
"CONTEXT"  and  "CRITERIA"  expressions,  must  be  satisfied  for  a  particular  metasymbol  value  to 


be  accessed.  The  use  of  these  expressions  is  similar  to  the  concept  of  frames  in  Artificial 
Intelligence  (Al). 

As  a  simple  example,  suppose  that  a  FORTRAN  program  called  for  the  computation  of 
"distance  equals  1/2  times  gravity  times  the  square  of  time": 

1000  DIST  -  0.5  *  GRAVITY  *  TIME**2 

This  statement  can  be  generalized  using  DARTS™.  Assuming  the  contexts  "EARTH"  and  "MOON" 
have  already  been  created,  the  DEFINE  blocks: 

<DEFINE  :GRAVITY>  <IN-CONTEXT  ":EARTH"> 
<START-VALUE>32.2<END-VALUE> 

<END-DEFINE> 

<DEFINE  ,GRAVITY>  <IN-CONTEXT  ":MOON"> 
<START-VALUE>5.6<END-VALUE> 

<END-DEFINE> 

create  a  knowledge-base  of  gravity  values.  The  FORTRAN  statement  is  now  written  as: 

1000  DIST  =  0.5*<{AXE™  syntax  for  "Replace"}  GRAVITY>*TIME  **2 

(Note:  Since  many  languages  include  "<"  and  ">"  in  their  character  set,  translation  map  routines 
have  been  built  using  AXE™.  Source  code  must  be  preprocessed  using  these  routines  to  remove 
the  Vs"  and  ">'s"  and  postprocessed  to  return  them  where  required.) 

Using  an  AXE™  statement  or  a  specification  language,  the  context  "EARTH"  is  established.  The 
BOLT™  Processor  would  then  interpret  the  above  statement  and  result  in  the  FORTRAN  line: 

1000  DIST  =  0.5*32.2*TIME**2 

With  this  technique  software  engineers  can  code  to  specification,  with  variations  on  the  theme 
implemented  using  the  knowledge-base. 

This  simple  example  shows  only  a  small  sample  of  the  power  of  this  technique.  The 
metasymbol  state  may  be  a  complex  boolean  expression  of  CRITERIA,  with  an  upper  limit  of 
4096  different  conditions.  The  " value "  between  <START-VALUE>  and  <END-VALUE>  can  be  an 
AXE™  statement  itself,  resolved  into  a  numerical  constant  prior  to  substitution  into  the 
FORTRAN  expression.  Thus,  in  the  previous  example,  TIME  could  be  changed  to  an  AXE™ 
statement  representing  the  value  obtained  by  a  call  to  a  real-time  clock  or  some  other  routine. 

Metasymbols  can  also  be  entire  code  sections,  replacing  symbol  references  according  to  the 
indicated  context.  This  capability  was  used  during  a  recent  rehost  of  DARTS™  from  the  DEC  VAX 
to  an  IBM  3081 .  The  sections  of  DARTS™  that  were  VAX  architecture-dependent  were  identified 
and  archetyped,  with  replacement  code  sections  written  that  were  targeted  for  the  IBM 
architecture.  This  method  can  be  very  beneficial  when  many  slightly  different  versions  of 
substantially  "the  same"  software  must  be  maintained. 

LIST  EXPANSION 

List-driven  expansion  is  another  powerful  AXE™  technique  useful  for  generating  multiple 


copies  of  similar  character  strings.  AXE™  lists  have  the  following  syntax: 

<L1ST  {Delimiter  definition}  COMPONENT!  {delimiter}...* 


where  {Delimiter  definition}  is  the  first  non-blank  character  after  LIST.  Most  special 
characters  can  be  used  as  delimiters.  Individual  LIST  components  can  be  data  structures  of  any 
kind.  For  example: 


<LIST/1/2/This  is  a  LIST/3/* 


is  a  legitimate  AXE™  list. 

In  list-driven  expansion,  special  AXE™  functions  define  a  "model"  copy  of  some  AXE™ 
program  fragment.  The  model  tells  AXE™  how  many  copies  of  itself  to  repeat.  Each  new  copy  is 
generated  by  inserting  different  text  at  predetermined  points  within  the  model  and  concatenating 
the  resulting  new  copy  to  the  previous  one.  The  final  concatenated  block  of  generated  statements 
replaces  the  model  itself  in  the  program  in  question. 

A  "model"  is  any  section  of  an  AXE™  program  headed  by  a  special  function  called  the 
"expansion"  function  ("<[>").  The  model  terminates  with  a  "model  termination"  function 
("<]>").  Actual  code  might  look  like: 

<[>  I  am  a  base  model  used  for  the  expansions]* 

Each  part  of  the  model  that  changes  from  copy  to  copy  is  built  from  within  the  model  by  an 
"inner  expansion"  function,  which  is  an  "expansion"  function  ("<[>”)  supplying  a  list  of 
strings: 


<{AXE™  execution  information}[{strings}]> 
Continuing  with  the  model  above,  an  expansion  list  could  be: 


m 


V.v 


:v  v 
,a>. 
-/V 

vv 

v'.-v'V 

.Vi: 


<[,  first,  second,third,]> 

The  word  LIST  is  not  necessary  here  because  the  use  of  "<["  implies  that  a  list  follows.  In  this 
case  a  delimits  the  LIST  components.  An  "inner  expansion"  function  placed  within  a  model 
directs  AXE™  to  generate  multiple  copies  of  the  model.  For  example,  the  statement: 


<[>  I  am  a  base  model  used  for  the  <[,first, second, third,]*  expansion.  <]* 
would  result  in 

I  am  a  base  model  used  for  the  first  expansion.  I  am  a  base  model  used  for  the  second  expansion.  I 
am  a  base  model  used  for  the  third  expansion. 

As  an  example  of  this,  consider  the  following  skeletal  Pascal  program  with  embedded  AXE™ 
statements: 


PROGRAM  COLORDEMO(INPUT, OUTPUT) 


CONST  NUMCOLORS  -  3; 

i 

TYPE  NAME  -  CHAR; 

COLORTYPE  -  ARRAY11  ..NUMCOLORS1  OF  NAME; 

i 

VARCOLORNAM :  COLORTYPE; 

i 

PROCEDURE  INIT; 

VAR  I.J  :  INTEGER; 

BEGIN 

<[> 

COLORNAM[<L1...J>] *<EXPAND  \:COLORLIST>';<EOL> 

<]> 

i 

END 


with  the  following  information  stored  in  the  knowledge-base: 

<DEFINE  COLORUST> 

<IN-CONTEXT  ":PRIMARY*> 

<START-VALUExUST|RED|YELLOW|BLUE|><EXIT-VALUE> 

<END-DEFINE> 

The  ellipsis  ("...")  is  a  default  in  AXE™  that  tells  the  BOLT™  Processor  to  expand  the  list, 
starting  at  "1"  in  this  case,  to  the  length  provided  by  the  next  list,  COLORLIST  here.  (Note  that 
the  was  selected  as  the  list  delimiter  for  this  example.)  <EXPAND  \:COLORLIST>  means  just 
that:  expand  COLORLIST  and  replace  this  expression  as  many  times  as  is  indicated  by  the  list 
length.  In  this  case  3  copies  are  produced.  Thus,  with  the  context  set  to  "PRIMARY”,  processing 
this  list  expansion  results  in  the  following  Pascal  code: 

: 

COLORNAM[1]  :« 'RED1; 

COLORNAM[2]  :=  'YELLOW; 

COLORNAM[3]  :« 'BLUE'; 

! 

List  expansion  can  be  combined  with  symbolic  referencing  to  further  generalize  the 
archetype.  Instead  of  simply  declaring  its  value  as  a  constant  at  the  start  of  the  program, 
NUMCOLORS  can  be  determined  by  obtaining  the  dimension  (or  number  of  elements)  of 
COLORLIST  in  the  proper  context,  i.e., 

CONST  NUMCOLORS  =  <_;DIMENSION.  \:COLORLIST>;. 

DIMENSION  is  just  one  of  the  many  useful  metasymbol  attributes  readily  available  to  AXErM 

programmers. 


Other  values  of  COLORLIST  can  be  added  to  the  knowledge-base  under  different  contexts,  e.g., 


<DEFINE  COLjORIJST> 

<IN-CONTEXT  ":SECONDARY"> 

<START-VALUExUST|ORANGE|GREEN|PURPLE|x$> 

<$$xEXIT-VALUE> 

<END-DEFINE> 

<DEFINE  COLORUST> 

<IN-CONTEXT  ":ALL"> 

<START-VALUExLIST|BLACK|xEXIT-VALUE> 

<END-DEFINE> 

<DEFINE  COLORUST> 

<IN-CONTEXT  "NONE"* 

<START-VALUExLIST|WHITE|xEXIT-VALUE> 

<END-DEFINE> 


Now,  expansion  under  the  context  ’PRIMARY*  would  result  in  the  same  Pascal  code  as  above.  If 
the  oontext  was  changed  to  "SECONDARY",  however,  the  expanded  code  would  be: 

i 

COLORNAM[1]  >  'ORANGE; 

COLORNAM[2]  >  'GREEN'; 

COLORNAM[3J  >  'PURPLE; 

j 


Finally,  setting  the  contexts  "ALL"  or  "NONE"  would  result  in: 

COLORNAMfl]  :-'BLACK’;  and  COlORNAM[1]  VWHITE’; 


respectively. 

As  stated  previously,  these  techniques  can  be  applied  to  any  source  language.  AXE™ 
statements  can  be  nested,  with  multiple  CONTEXT'S  and  CRITERIA'S,  resulting  in  an  almost 
limitless  number  of  archetyping  combinations. 

OTHER  DARTS™  APPLICATIONS 

The  DARTS™  Technology  has  also  been  successfully  used  in  a  number  of  other  application 
areas: 

•  Language  Translators  -  the  pattern-matching  and  knowledge  base  capabilities  of  DARTS™, 
originally  included  to  allow  the  parsing  and  comprehension  of  specification  languages,  make 
DARTS™  the  ideal  vehicle  for  language  translation.  This  has  been  the  most  successful  application 
area  to  date.  Translators  providing  over  80%  conversion  rates  have  been  brought  up  in  periods 
ranging  from  2  man-weeks  to  2  man-months,  depending  on  the  languages  involved.  DARTS™ 
engineers  work  closely  with  customers  to  determine  the  most  cost  effective  translation 


percentage,  identifying  the  level  and  nature  of  the  manual  effort  required  to  complete  the 
conversion.  Documentation,  at  least  in  a  rudimentary  form,  can  also  be  produced  as  a  by-product 
of  the  translation  process. 

The  list  of  completed  translators  includes  CDC  Assembly  Language  to  Cray  Assembly 
Language,  Franr  Lisp  to  Common  Lisp,  COBOL  to  FORTRAN,  and  Pascal  to  C.  DARTS™  has  been 
particularly  useful  in  converting  fourth  generation  languages  good  for  prototyping  (e.g.,  Mark 
IV,  Model  204  User  Language)  into  production  quality  systems  (COBOL  for  these  examples). 

*  Knowledge-Based  System  Development  -  The  list/set  processing  and  knowledge-base 
capabilities  of  DARTS™  also  make  it  a  useful  tool  for  expert  system  development.  However, 
DARTS™  is  flfll  an  expert  systems  shell,  like  ART  or  KEE.  Building  an  expert  system  using 
DARTS™  is  just  like  building  one  from  scratch  using  Lisp.  Both  require  the  same  types  of 
resources  and  development  efforts.  Knowledge  engineers  are  a  vital  commodity.  An  inference 
engine  must  be  constructed.  The  user  interface  must  be  defined  and  built. 

On  the  other  hand,  DARTS™  does  provide  some  advantages.  Knowledge-bases  and  knowledge 
representation  methods  are  built  into  the  technology.  While  custom  inference  engines  are 
necessary  for  each  DARTS™-based  expert  system,  experience  on  Lisp-based  systems  has  shown 
that  inferencing  is  most  effective  when  tied  directly  to  the  knowledge-base  structure.  DARTS™ 
also  provides  a  screen  definition  and  processing  capability  that  greatly  simplifies  user  interface 
development. 

A  number  of  expert  systems  have  been  completed  for  General  Dynamics  internal  projects, 
including  a  Design  Analysis  Expert  System.  This  system  was  designed  to  analyze  engineering 
drawing  notes  for  completeness  and  internal  consistency  with  required  company  and  government 
specifications.  The  menu-driven  system  highlights  errors  and  provides  specification  numbers 
and  suggested  corrections  for  each  entry. 

PRODUCT  INFORMATION 

DARTS™  is  currently  available  under  license  from  General  Dynamics.  DARTS™  includes  the 
BOLT™  Processor,  the  AXE™  Language  (AXEL)  knowledge-base,  and  system  documentation.  The 
BOLT™  Processor  consists  of  approximately  50,000  lines  of  PASCAL,  FORTRAN  and  assembly 
language  code  and  executes  in  a  8  MB  virtual  image  on  Digital  Equipment  Corporation  VAX  series 
computers  under  VMS.  The  IBM  version,  written  all  in  Pascal,  runs  on  the  308X  series  under 
MVS/XA.  BOLT™  Jr.,  an  OEM  BOLT™  Processor  ,  is  an  execution-only  version  to  support 
secondary  sales  of  programs/systems  written  in  the  AXE™  language. 

SUMMARY 


DARTS™  provides  a  knowledge-based  solution  to  software  reuse.  The  examples  and  AXE™ 
language  information  within  were  intended  to  provide  insight  into  the  power  of  archetyping, 
without  getting  bogged  down  in  technical  details.  If  this  message  came  through,  it  is  hoped  that 
DARTS™,  or  at  least  these  concepts,  can  form  the  basis  for  an  innovative  solution  to  the 
"software  crisis." 
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Over  the  last  three  years  there  has  been  a  resurgence  of  interest  in  the  design  of 
large  parallel  computer  systems.  Several  new  classes  of  highly  parallel  computers  have 
recently  been  introduced.  These  include: 

(1)  massively  parallel  hypercubes,  such  as  the  64,000  1-bit  computer  SIMD  network 
Connection  Machine  1; 

(2)  large  multiprocessors  with  a  few  hundred  processors  sharing  a  global  memory  via  a 
log(N)-stage  interconnection  network,  such  as  the  IBM  RP32  and  University  of  Illi¬ 
nois  Cedar3,  each  with  about  512  processors; 

(3)  large  multiprocessor/multicomputers  with  clusters  of  a  few  hundred  computers  shar¬ 
ing  a  common  memory  and  linked  to  other  clusters,  such  as  Cedar3  and  the  planned 
Vitesse  VNP4’  5  tree  network  of  up  to  1089  computers,  linked  within  each  cluster  by 
a  crosspoint  switch  to  shared  memory  and  between  clusters  as  a  tree  of  fanout  8  to 
16. 

(4)  small  multiprocessors  with  a  few  dozen  processors  sharing  memory  via  a  fast  global 
bus,  such  as  the  Encore  Multimax6  and  Alliant  FX/87; 

(5)  large  multicomputers  of  a  few  hundred  computing  nodes  linked  by  hypercube  inter¬ 
connections,  such  as  Ncube  and  the  Intel/Caltech  Cosmic  Cube8; 

(6)  massive  SIMD  array  processors,  such  as  the  11  Gflop  IBM  GFll9,  built  specifically 
for  physics  quark  modelling;  and 

(7)  one  to  sixteen  processor,  shared  memory  versions  of  the  ultrafast  vector  processors 
that  are  the  current  commercial  supercomputers,  such  as  Cray  X/MP-410.  Cray?11, 
Fujitsu12,  Hitachi12,  NEC5,  CDC  Cyber-Plus5,  and  the  planned  ETA-101*'. 

1.  Operating  Systems  for  New  Parallel  Machines 

The  operating  systems  of  all  these  machines  differ  significantly.  SIMD  machines 
have  only  one  instruction  stream  common  to  all  processors.  Resource  management  is  a 
significant  part  of  the  algorithm  programmer’s  task.  Each  algorithm  must  deal  directly 
with  allocation  of  data  to  memories  and  the  movement  of  data  to  the  processors  when 
needed.  The  operating  system  for  an  SIMD  machine  usually  runs  on  the  general  purpose 
host  to  which  it  is  attached.  It  aids  development  of  new  algorithms,  loads  new  jobs,  and 
accepts  the  torrent  of  data  produced  during  execution. 


The  new  shared  memory  multiprocessor  systems  are  a  significant  interim  step  in 
solving  the  programming  problems  posed  by  massively  parallel  multicomputers  with 
separate  memories  for  each  processor.  By  having  special  hardware  support  for  access  to 
globally  shared  memory,  they  allow  concurrent  programs  to  have  many  processors  pass¬ 
ing  common  data  rapidly  via  the  shared  memory.  Programmers  need  worry  only  about 
how  to  subdivide  their  algorithms  into  separate  processes.  They  need  not  worry  about 
delays  in  exchanging  data  among  processes.  Communication  delays  are  a  major  addi¬ 
tional  concern  for  distributed  programs  running  on  multicomputers  without  shared 
memory. 

Multiprocessor  programmers  still  must  worry  about  decomposition  of  problems  into 
processes,  synchronization  between  processes,  and  partial  failures  by  individual  proces¬ 
sors.  Multiprocessors  are  only  an  interim  step  because  rapid,  frequent,  simultaneous 
access  to  a  globally  shared  shared  random  access  memory  by  millions  of  processors  is 
almost  certainly  an  impossible  task.  With  more  than  ten  processors  sharing  memory, 
there  are  severe  problems  with  memory  bandwidth.  More  insidiously,  when  there  are 
many  processors,  most  may  be  waiting  on  locks  to  access  shared  data  that  another  pro¬ 
cessor  is  changing. 

The  small  multiprocessors  (Encore,  Alliant)  use  high-speed  global  busses  and 
hardware  support  of  synchronization  and  locking  operations  such  as  test-and-set  for 
shared  memory  access  by  up  to  20  processors.  Further  expansion  of  these  systems  will  be 
very  difficult,  especially  because  of  locking  delays.  The  large  multiprocessors  (RP3, 
Cedar,  Vitesse)  all  have  permanent  memory  and  recent  access  caches  local  to  each  pro¬ 
cessor  to  greatly  reduce  reading  of  shared  memory.  In  many  ways,  they  are  multicom¬ 
puters.  They  also  have  efficient  interconnection  networks  to  allow  all  processors  to  access 
different  parts  of  shared  memory  simultaneously  and  very  special  hardware  to  minimize 
blocking  during  writes  to  common  memory  locations. 

In  particular,  both  the  IBM  RP3  and  the  Cedar  multiprocessors  use  Iog(N)-stage 
Omega14  networks  for  N  simultaneous  processor-to-memory  accesses.  As  justified  by 
simulations  for  the  NYU  Ultracomputer15,  several  million  dollars  of  the  costs  ($20M)  of 
the  512  processor,  800  Mflop  RP3  is  special  network  logic  that  combines  simultaneous 
fetch-and-add  (F&A)  operations  to  the  same  memory  location.  The  combining  logic 
returns  a  consistent  set  of  fetched  values  to  the  various  processors  and  adds  the  sum  of 
their  increments  to  the  memory  location.  The  effect  is  exactly  the  same  as  if  the  simul¬ 
taneous  operations  executed  one  after  each  other  in  some  unspecified  order.  In  particu¬ 
lar,  this  F&A  operation  can  be  used  to  manage  queues  for  operating  systems  such  as 
Unix  so  that  many  processors  can  simultaneously  add  and  delete  entries  without  locks. 

Almost  all  of  the  shared  memory  multiprocessors,  including  the  Cray2  and  Alliant 
vector  processing  systems,  support  the  popular  Unix  operating  system  because  it  offers  a 
flexible  interface  to  users.  With  shared  memory,  it  is  easy  to  support  parallel  execution 
at  the  coarse-grained  Unix  process  level.  If  there  are  enough  processes,  all  processor  can 
work  simultaneously,  except  during  synchronization  waits.  The  vector  processors  also 
offer  tools  for  extracting  fine-grained  concurrency  from  manipulations  of  matrices  and 
vectors.  For  example,  different  Cray  processors  may  be  simultaneously  executing  the 
same  Fortran  DO-loop  for  different  ranges  of  index  values. 
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2.  Operating  Systems  for  Future  Massively  Parallel  Machines 

The  current  hypercube  multicomputers  (Cosmic  Cube  and  Ncube)  and  the  planned 
MIMD  version  of  the  Connection  Machine  are  precursors  of  future  massively  parallel 
multicomputers  ("megacomputers").  Frequent  access  to  globally  shared  memory  will  be 
impossible  in  megacomputers.  In  the  near-future,  designs  like  the  Vitesse  system,  which 
is  a  multicomputer  tree  of  multiprocessors,  are  very  likely  to  be  the  most  efficient  way  to 
build  machines  with  many  thousands  of  processors.  Clusters  of  processors  can  access  a 
hierarchy  of  shared  memories,  with  the  most  distant,  globally  shared  memories  accessed 
very  infrequently. 

There  appear  to  be  at  least  three  avenues  of  study  for  algorithms  to  manage 
resources  and  to  perform  application  calculations  on  megacomputers,  where  system  size 
will  lead  to  delays  and  other  inaccuracies  in  data  from  distant  computing  elements: 

(1)  distributed  algorithms,  in  which  separate  processors  use  mainly  local  data  aug¬ 
mented  by  infrequent  accesses  ("messages")  to  data  in  memories  of  distant  proces¬ 
sors; 

(2)  probablistic  algorithms16,  in  which  good  approximations  to  exact  results  can  be  cal¬ 
culated  rapidly  from  volumes  of  input  data,  even  if  a  few  items  are  incorrect. 

(3)  asynchronous  algorithms,  in  which  correct  results  do  not  depend  on  locked  step 
operations  by  different  processors,  avoiding  wasteful  waits  to  achieve  synchrony. 

Work  at  SUNY/Stony  Brook  on  control  algorithms  for  megacomputers  has  already 
started  exploring  two  of  these  avenues,  tree-based  distributed  control  algorithms17'19  for 
networks  of  computers  and  asynchronous  communication  and  data  replication 
mechanisms20"23  for  networks.  Even  computers  built  from  processors  clustered  around 
locally  shared  memories  have  data  sharing  restrictions  consistent  with  a  model  of  (mul¬ 
tiprocessor)  computers  embedded  in  a  communication  network. 

2.1.1.  Dynamic  Group  Organization  and  Use  for  Distributed  Systems 

An  extensive  study  has  been  made  of  ways  to  implement  and  use  multicast  com¬ 
munication  within  groups  of  components  in  large  netcomputers,  especially  ones  linked  by 
grids  of  ethernets.  Techniques  for  organizing  dynamic  groups  have  been  studied  for  use 
in  supporting  multicast  communication,  in  maintaining  consistent  replicated  data,  and 
in  providing  network  services.  Early  results  have  been  published  in  three  papers20-22  and 
a  dissertation23. 

2.1.2.  Tree-based  Distributed  Control  Algorithms 

Work  has  begun  in  exploring  the  efficiency  of  several  newly  created  distributed  algo¬ 
rithms  for  resource  management  in  large  networks.  Most  of  these  algorithms  use 
hierarchical  patterns  of  communication  to  allow  efficient  centralized  control  of  high-level 
performance.  However,  they  are  being  analyzed  both  in  terms  of  tree-branch  distances 
and  in  terms  of  physical  distances  for  the  common  cases  where  they  are  imbedded  in  reg¬ 
ular  physical  topologies  such  as  2-  or  3-dimensional  nearest-neighbor  meshes. 

Because  most  of  their  data  paths  traverse  an  O(logiV)  number  of  tree  branches, 
tree-based  control  algorithms  seem  suitable  to  a  large  number  of  different  underlying  net¬ 
work  topologies.  In  other  words,  they  hold  promise  of  being  nearly  universally  applicable 
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to  a  large  class  of  network  computers.  Of  course,  raw  state  data  from  the  lower  levels  of 
the  tree  must  be  compressed  or  filtered  at  each  higher  management  level  to  avoid  over¬ 
loading  nodes  near  the  root. 

The  one  completed  study19  evaluates  an  algorithm  for  very  locally  reconfiguring  a 
spanning  tree  of  processors  assigned  network  management  duties  after  a  failure  has 
incapacitated  one  of  them.  The  analysis  techniques  are  able  to  give  closed  form  expres¬ 
sions  for  the  expected  values  of  several  measures  of  the  algorithm,  including: 

(1)  the  maximum  extent  of  network  disruption  during  the  recovery  after  one  failure, 

(2)  the  increase  in  total  spanning  tree  branch  lengths  per  local  reconfiguration, 

(3)  total  message  traffic  generated  to  reconfigure  after  one  failure,  and 

(4)  the  total  delay  until  the  control  tree  is  completely  recovered  from  each  failure. 

There  are  a  number  of  other  distributed  control  algorithms  for  which  some  initial 
results  are  known.  Each  is  based  upon  spanning  trees  of  control  processors.  Each  drasti¬ 
cally  restricts  the  flow  of  management  data  up  the  tree  to  avoid  traffic  bottlenecks  near 
the  root.  They  include: 

(1)  two  other  algorithms  (  Promotion  and  Chaining  )  for  local  repair  of  spanning 
trees  of  processors  used  to  control  massive  networks, 

(2)  algorithms  for  finding  near-optimal  routes  in  large  networks  without  using  either 
global  models  of  the  topology  or  global  broadcasts  to  find  the  shortest  path  from  a 
source  to  a  new  destination, 

(3)  algorithms  for  managing  limited-size  distributed  caches  mapping  names  to  network 
identifiers  for  recently  accessed  resources  so  that  requests  for  specific  resources  can 
usually  be  satisfied  without  expensive  global  searches, 

(4)  algorithms  for  finding  a  nearby  instance  of  a  large  pool  of  identical  resources,  such 
as  an  idle  processor  that  can  immediately  execute  a  single  task,  and 

(5)  an  algorithm  for  finding  a  physically  close  group  of  idle  processors  for  co-scheduling 
of  several  tasks  that  must  be  executed  simultaneously  since  they  interact  closely. 

2.1.3.  A  Comparison  of  Two  Failure  Recovery  Algorithms  for  Trees 

As  a  sample  of  the  results  that  can  be  obtained  from  these  analyses  of  specific  algo¬ 
rithms,  here  is  a  summary  of  the  characteristics  of  two  algorithms.  Each  locally  replaces 
a  failed  member  of  a  tree  of  computers.  The  computers  are  embedded  in  a  2-dimensional 
sheet  and  each  has  connections  to  a  few  neighbors.  The  first  algorithm 
(  Promotion  )  locates  a  leaf  node  physically  near  the  failed  node  and  links  it  into  the 
tree  as  an  in-place  substitute.  The  second  algorithm  (  Chaining  )  substitutes  one  of  the 
tree-children  of  the  failed  node  for  its  missing  parent,  then  recursively  substitutes  at  each 
lower  level  until  a  descendant  which  is  a  leaf  and  has  no  children  is  substituted  for  its 
immediate  parent. 

For  both  these  algorithms,  it  is  assumed  that  there  are:  /  immediate  children  under 
each  management  node  in  the  tree;  h  levels  of  nodes  in  each  tree  with  the  root  at  level  0: 
an  expected  standard  physical  distance  d  from  each  lowest  level  h— 2  manager  to  its  leaf 
children  at  level  h— 1;  a  time  unit  t  needed  for  a  message  to  travel  a  physical  distance  d\ 


and  a  packing  ratio  r  for  the  expected  value  of  grandparent-to-parent 

versus  parent-to-child  physical  distances  at  each  level.  For  the  purposes  of  this  analysis, 

groups  are  assumed  to  be  clustered  in  a  plane,  yielding  r  =/. 

The  general  mode  of  analysis  for  each  of  the  evaluated  attributes  of  the  algorithms 
is  to  determine  the  effect  of  a  failure  at  an  arbitrary  level  k ,  then  multiply  by  the 
number  of  nodes  at  level  k  (/  ),  and  sum  over  all  levels  (O^k^h-1).  Dividing  by  the 

fh- 1 

total  number  of  nodes  ( - )  determines  the  expected  value  for  failures  uniformly  dis- 

/- 1 

tributed  thoughout  the  network.  For  realistic  tree  fanout  values  near  /  =  9,  each  value 
is  dominated  by  the  results  of  leaf  failures.  The  measures  of  each  algorithm  include: 

(1)  Impact,  the  number  of  nodes  directly  affected  by  reconfiguration  after  one  failure; 

(2)  Growth,  the  increase  in  total  tree  branch  physical  lengths  per  reconfiguration; 

(3)  Traffic,  the  product  of  the  total  number  of  messages  times  the  physical  distance 
each  travels  for  all  messages  generated  for  one  reconfiguration;  and 

(4)  Delay,  the  time  for  all  reconfiguration  messages  to  travel  with  the  maximum  paral¬ 
lelism  allowed,  during  the  recovery  from  each  failure. 

For  d  the  short  internode  physical  distance  at  leaf  level,  for  t  the  unit  time  to  send 
a  message  a  distance  d  in  the  underlying  physical  network,  and  ignoring  terms  that  are 
insignificant  for  large  values  of  N~f  -1,  the  simplified  vaiues  are: 

Algorithm  Comparison  Equations 


Measure 

Promotion 

Chaining 

2 

1 

Impact 

«  2+ — 

=  2+ - 

f2 

/-I 

Assuming  that  fanout  /=  9,  the  values  of  the  simplified  expressions  are: 


Algorithm  Comparison  Values 

(for  Large  Networks) 


Measure 

Promotion 

Chaining 

Impact 

2.03 

2.13 

Growth 

0.05d 

0.40<f 

Traffic 

6.8  d 

8.2  d 

Delay 

1.06f 

0.90t 

The  very  simple  Promotion  algorithm  is  better  both  in  terms  of  lessening  expected 
branch  growth  and  in  terms  of  the  total  message  traffic  impact  on  the  communications 
system.  If  all  computing  nodes  in  the  network  have  equivalent  computing  and  communi¬ 
cation  capabilities,  the  Promotion  algorithm  is  preferred.  If  the  network  is  non- 
homogeneous  because  nodes  at  the  higher  levels  generally  have  better  access  to  commun¬ 
ications  or  other  resources  needed  for  effective  management  of  the  network,  then  the 
Chaining  algorithm  is  preferred. 

S.  Conclusions 

These  studies  are  determining  which  distributed  control  algorithms  can  run 
efficiently  on  many  large  computing  systems.  The  emphasis  throughout  is  on  evaluations 
for  computing  systems  with  thousands  or  millions  of  interconnected  processing  nodes. 
Massive  parallel  systems  are  becoming  economically  feasible  and  justifiable.  An  under¬ 
standing  of  how  to  structure  parallel  algorithms  for  them  is  urgently  needed. 
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ABSTRACT 

In  this  paper  we  consider  the  problem  of  designing  and  selecting  load  balancing 
mechanisms  for  distributed  systems  based  on  local  area  networks.  In  particular,  we 
will  focus  on  the  type  of  information  needed  to  make  balancing  decisions  and  on  the 
desirable  properties  of  the  decision  algorithms.  We  also  describe  our  current  approach 
to  this  problem,  which  essentially  consists  of  carrying  out  a  series  of  experiments  on  a 
prototype  load  balancing  implementation.  Finally,  we  present  the  insights  we  have 
derived  from  our  experimental  results. 


1.  Introduction 

Many  of  the  currently  existing  computing  environments  consist  of  a  heterogeneous 
collection  of  of  workstations  and  mainframes  connected  by  a  high  bandwidth  local 
area  network  (LAN).  One  of  the  main  benefits  of  working  in  a  distributed  system 
created  for  such  an  environment  is  being  able  to  share  scarce  resources  with  other 
users  of  the  network;  but  one  resource  that  is  often  not  shared  is  the  processing  capa¬ 
city  of  the  network  nodes.  In  many  systems,  the  scheduling  of  user  jobs  is  individually 
carried  out  by  each  processor,  and  the  computations  of  the  users  logged  on  at  any  one 
machine  are  performed  locally.  The  decentralization  of  CPU  management,  coupled 
with  large  differences  in  the  numbers  (and  types)  of  users  connected  to  each  of  the 
nodes  in  the  network,  often  can  lead  to  situations  where  there  are  great  disparities  in 
load  among  the  machines  on  the  network.  For  example,  at  our  local  computer  center, 
as  the  due  date  for  a  class  assignment  approaches,  the  processor  assigned  to  the  stu¬ 
dents  in  that  class  becomes  heavily  loaded,  while  other  machines  are  underutilized. 
Although  users  can  determine  by  themselves  that  an  imbalance  exists,  and  remotely 
log  onto  another  computer,  we  feel  that,  in  order  to  prevent  a  chaotic  situation  and  to 
aid  naive  users,  it  is  best  to  develop  strategies  that  can  solve  the  load  balancing  prob¬ 
lem  in  an  automatic  way,  much  in  the  way  that  users  now  depend  on  virtual  memory 
techniques  to  manage  their  memory  space. 

We  are  currently  exploring  the  design  of  load  balancing  strategies  for  LAN-based 
distributed  systems.  In  this  paper,  we  describe  some  of  the  strategies  that  we  are  con¬ 
sidering  for  implementation,  as  well  as  the  criteria  we  are  using  to  determine  which 
schemes  are  appropriate  for  a  given  environment. 


In  the  next  section  we  provide  some  details  concerning  our  approach  to  the  prob¬ 
lem.  Sections  3  and  4  deal  with  load  metrics  and  decision  policies  respectively  (these 
terms  are  defined  in  Section  2).  The  last  section  of  the  paper  discusses  the  current 
state  of  our  work  and  presents  some  preliminary  conclusions. 

2.  Our  Approach 

Although  load  strategies  have  been  studied  in  the  past,  most  of  this  work  has 
been  carried  out  within  a  theoretical  framework  (see  [Chul980]  for  a  survey  of  some  of 
these  approaches);  moreover,  some  of  the  researchers  carrying  out  those  studies  have 
made  simplifying  assumptions  that  may  not  hold  in  practice.  Also,  there  have  been  a 
small  number  of  load  balancing  implementations  (for  example,  see[HwangI982]),  but, 
as  far  as  we  know,  the  strategies  employed  were  chosen  without  extensive  study.  Our 
work  differs  from  most  of  the  previous  research  in  that,  while  we  are  interested  in 
actually  implementing  a  load  balancing  mechanism,  we  will  not  chose  a  strategy  in  an 
ad  hoc  fashion,  but  rather,  we  plan  to  devote  extensive  study  to  the  choice  of  strategy 
to  be  programmed. 

Before  we  describe  our  approach  further,  a  few  remarks  are  needed.  A  load 
balancing  strategy  is  composed  of  two  parts,  a  load  information  aspect  (i.e.,  what 
information  will  be  used  to  determine  the  load  in  the  machines  of  the  network),  and  a 
decision  policy  (i.e.,  given  the  load  information,  how  will  it  be  used  to  decide  where  to 
run  a  job).  In  order  to  speak  meaningfully  of  a  machine  being  “more  loaded”  than 
another,  we  define  a  load  metric  as  a  real-valued  function  of  the  load  information.  For 
example,  the  load  information  could  be  composed  of  the  number  of  processes  in  the 
CPU  ready  queue  for  a  given  interval;  a  possible  load  metric  would  then  be  the  aver¬ 
age  number  of  ready  jobs  during  the  last  minute.  This  metric  is  essentially  the  “load 
average”  metric  provided  by  the  UNIX  4.2  BSD  [LefflerI984j  uptime  command.  In 
the  next  two  sections  we  consider  load  metrics  and  decision  policies  in  greater  depth. 

Our  current  study  of  the  load  balancing  problem  comprises  three  phases.  First,  to 
consider  what  are  suitable  load  metrics  for  our  environment.  Then,  to  list  a  number  of 
possible  decision  policies  that  make  use  of  the  load  metrics  being  studied.  Finally, 
implementing  various  combinations  of  policies  and  metrics,  in  order  to  study  their  per¬ 
formance  for  a  number  of  synthetic  workloads. 

We  have  already  developed  a  prototype  implementation  of  a  load  balancing 
mechanism,  which  can  be  used  with  different  load  strategies.  It  consists  essentially  of 
a  shell  that  can  make  scheduling  decisions  based  on  load  information  broadcast  by 
cooperating  daemons  (see  (Alonsol986a]  for  more  details).  Our  prototype  now  runs  in 
a  laboratory  consisting  of  a  variety  of  SUN  workstations,  connected  by  an  Ethernet 
[MetcalfeI976].  Although  experiments  are  still  being  carried  out.  we  will  briefly  com¬ 
ment  on  our  current  results  in  the  concluding  section  of  the  paper. 

3.  Load  Metrics 

The  selection  of  a  load  metric  requires  a  careful  definition  of  what  is  meant  by 
the  load  of  a  processor.  It  seems  clear  that  load  should  be  defined,  at  least  partially, 
in  terms  of  a  set  of  performance  indices  (such  as  CPU  utilization  or  mean  number  of 
I/O  requests),  hut  it  is  less'  clear  that  two  different  processes  should  use  the  same 
definition  of  load:  for  example,  an  I/O  intensive  job  will  probably  perceive  load  in  a 


different  way  than  a  CPU-bound  job.  Whether  the  benefits,  if  any,  of  using  different 
load  metrics  for  different  tasks  are  sizable  is  not  immediately  obvious;  at  any  rate,  at 
present  we  are  considering  only  global  load  metrics.  Our  rationale  for  this  decision  is 
twofold.  In  the  first  place,  in  our  computational  environment  we  typically  do  not 
know  the  characteristics  of  the  jobs  being  executed,  and  thus,  cannot  compute  job- 
specific  load  metrics.  Secondly,  we  feel  that  a  simpler  technique  has  a  better  chance 
of  working  well  in  an  implementation  than  a  more  complex  approach. 

A  compromise  between  global  and  job-specific  load  metrics  is  to  compute  a 
different  load  metrics  for  each  job  class.  The  simplest  such  scheme  would  involve 
categorizing  processes  as  either  CPU  or  I/O  bound.  Although  promising,  this 
approach  would  not  be  of  interest  in  our  system,  since  most  user  jobs  obtain  their  data 
(more  precisely,  all  their  pages)  from  a  network  file  server.  Thus,  there  is  little  one 
can  do  (in  terms  of  migrating  the  task)  for  I/O  bound  jobs,  except  to  exclude  them 
from  the  load  balancing  algorithm.  (Actually,  this  is  equivalent  to  a  load  metric  that 
has  the  same  value  for  all  machines  under  all  conditions.)  More  will  be  said  about  the 
types  of  jobs  that  should  not  be  migrated  in  the  next  section. 

Another  important  issue  is  that,  since  load  metric  information  cannot  be  con¬ 
stantly  broadcast  (because  it  would  be  too  costly),  the  rate  at  which  such  broadcasts 
are  made  needs  to  be  studied.  The  problem  that  arises  is  that  machines  are  making 
their  job  scheduling  decisions  based  on  possibly  stale  data.  This  could  lead  to 
unstable  behavior;  for  example,  an  idle  node  could  start  receiving  migrated  jobs  from 
many  other  machines,  and  by  the  time  it  broadcast  that  it  was  overloaded,  perhaps 
too  many  jobs  would  have  been  sent  to  it.  We  have  already  seen  such  behavior  in  our 
previous  study  of  the  possible  improvements  of  introducing  load  balancing  strategies  in 
distributed  database  query  optimizers  [Alonsol986b]. 

Actually,  the  cost  of  frequent  load  information  broadcasts  involves  two  factors: 
the  cost  of  computing  the  load  metric,  and  the  overhead  of  sending  and  receiving  the 
messages  over  the  network.  One  possible  compromise  would  be  to  gather  the  load 
information  more  frequently,  but  broadcast  it  at  less  frequent  intervals.  The  rationale 
for  this  is  that,  with  this  approach,  each  machine  has  a  better  idea  of  its  local  load, 
and  can  start  migrating  all  of  its  jobs  when  it  is  overloaded.  One  of  the  results 
described  in  [Alonsol986b]  suggests  that,  at  least  in  the  context  of  database  jobs,  this 
strategy  can  be  quite  successful.  We  plan  to  determine  if  that  result  holds  for  general 
purpose  jobs. 

A  possible  candidate  for  a  load  metric  is  the  load  average  described  above. 
Another  is  the  effective  load  average,  defined  as  the  load  average  divided  by  the  pro¬ 
cessor  MIPS  rate.  A  third  is  CPU  utilization.  The  number  of  ready  jobs  plus  the 
number  of  disks  requests  per  unit  of  time  seems  a  more  comprehensive  measure  of  load 
than  a  metric  that  involves  solely  the  demand  on  the  CPU.  Another  possibility  is  to 
compute  the  utilization  of  the  bottleneck  resource  at  every  machine  (since  the 
throughput  of  a  system  is  dominated  by  the  scarcest  resource,  the  utilization  of  this 
critical  resource  is  a  good  indication  of  the  node  load).  Lastly,  a  binary  metric  (“idle" 
or  “not_jd!e”)  would  be  an  appropriate  choice  if  we  were  interested  in  migrating  jobs 
primarily  to  unused  machines  on  the  network.  Currently,  we  are  planning  to  study 
most  of  the  metrics  just  described. 
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The  are  a  number  of  issues  involved  in  evaluating  a  load  metric: 

[1]  The  stability  of  a  metric  is  important,  because  if  it  responds  too  quickly  to  minor 
changes  in  system  parameters  it  may  lead  to  unstability  in  our  load  balancing 
strategies.  The  measurement  technique  used  to  compute  the  metric  is  relevant 
here  (for  example,  exponentially  smoothing  a  sample  metric  can  improve  the  sta¬ 
bility  of  the  metric). 

[2]  We  are  also  interested  in  the  generalizability  of  the  metric,  since  we  desire  our 
schemes  to  be  useful  for  non-UNIX  systems  as  well  as  being  valid  for  a  hetero¬ 
geneous  set  of  hardware  configurations. 

[3]  Clearly,  the  implementability  of  the  metric  is  of  primary  concern.  Since  we 
intend  to  carry  out  experiments  using  the  various  metrics,  the  ones  that  are  either 
too  difficult  or  too  expensive  to  implement  or  to  compute  must  be  discarded. 

[4]  Finally,  there  must  be  empirical  evidence  that  the  metric  to  be  implemented 
actually  reflects  our  intuitive  notion  of  “load”.  This  can  be  determined  by  experi¬ 
mental  study. 

Before  we  leave  this  topic  it  should  be  pointed  out  that,  clearly,  a  load  metric  is 
only  as  good  as  the  operating  system  statistics  on  which  it  is  based.  In  the  case  of 
UNIX  4.2  BSD,  it  is  not  clear  that  the  statistics  provided  by  the  kernel  can  be  trusted 
completely.  Thus,  we  also  plan  to  carry  out  a  study  of  the  statistics  gathering 
software  of  BSD  UNIX. 


4.  Decision  Policies 

Before  we  describe  some  of  the  decision  policies  that  seem  attractive,  we  will 
touch  upon  some  of  the  salient  features  of  such  policies.  Policies  can  be  categorized  as 
more  static  or  more  dynamic;  for  example,  always  using  the  same  computational  server 
to  offload  a  machine  would  be  a  very  static  policy,  while  choosing  the  server  at  run¬ 
time  is  more  dynamic.  Some  policies  require  knowing  varying  amounts  of  information 
about  a  process  in  order  to  schedule  it  (e.g.,  a  strategy  may  require  that  we  know  if  a 
process  is  CPU  or  I/O  bound).  The  scheduling  decision  may  be  made  at  a  central  site 
or  in  a  distributed  fashion  (we  are  only  interested  in  the  latter);  also,  the  location  of 
the  decision  mechanism  may  impact  the  number  of  messages  that  have  to  be  sent  to 
communicate  load  information.  A  key  feature  of  any  policy  is  whether  it  allows 
preemption  or  not  (i.e.,  whether  we  are  allowed  to  keep  migrating  jobs  after  they  start 
executing);  we  will  not  consider  such  policies  since  we  cannot  currently  implement 
them  in  a  UNIX  environment.  Some  policies  may  limit  the  processors  that  may  be 
chosen;  a  possible  strategy  may  be  to  let  only  idle  nodes  compete  for  tasks,  or  to  con¬ 
sider  only  machines  with  a  certain  MIPS  rate  for  task  allocation.  The  choice  of 
sender-initiated  versus  receiver-initiated  balancing  must  be  settled  (i.e.,  whether  to 
allow  busy  nodes  to  look  for  an  idle  node,  or  to  have  idle  nodes  advertise  their  ability 
to  work).  Finally,  it  is  clear  that  some  jobs  should  never  be  migrated;  this  may  be  so 
for  a  variety  of  reasons:  perhaps  the  jobs  should  only  be  run  locally  for  security  rea¬ 
sons,  or  maybe  the  computational  demands  of  the  task  are  so  slight  that  the  overhead 
of  moving  it  overshadows  any  possible  performance  gain.  There  are  still  many  other 
issues  to  be  considered:  do  we  optimize  for  the  current  task  or  do  we  look  at  sets  of 
task-'.’  do  we  examine  only  jobs  that  have  arrived  at  our  local  decision  maker 
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mechanism  or  at  all  the  incoming  jobs  in  the  network?  do  we  remember  previous  task 
assignments  when  we  make  a  decision?  which  performance  index  do  we  focus  on? 

There  are  many  policies  that  could  be  of  practical  use.  Perhaps  the  most  natural 
policy  is  to  send  jobs  to  the  “least  loaded”  machine.  A  variation  that  may  prove  more 
stable  consists  of  migrating  jobs  to  any  machine  whose  load  is  less  than  the  local  load 
by  a  specified  amount.  In  order  to  decrease  the  chance  that  all  busy  machines  select 
the  same  host  to  which  they  will  migrate  their  tasks,  each  machine  could  choose  the 
destination  randomly  from  the  n  least  loaded  processors  (for  some  n).  Another  alter¬ 
native  is  to  only  move  jobs  to  idle  processors.  Finally,  always  sending  the  jobs  of  over¬ 
loaded  processors  to  a  powerful  computational  server  may  be  reasonable  for  some 
environments. 

In  judging  a  decision  policy  we  may  consider  a  number  of  factors: 

[1]  As  we  mentioned  above,  the  stability  of  the  policy  under  imperfect  information  is 
important,  since  we  expect  that  imperfect  information  will  be  the  rule  rather  than 
the  exception. 

[2]  We  would  prefer  that  the  cost  of  the  load  balancing  scheme  be  negligible  com¬ 
pared  to  its  benefits  (note  that  there  are  two  types  of  costs,  the  costs  to  users  of 
using  the  system,  and  the  cost  to  non-users  of  the  added  overhead  of  running  the 
decision  mechanism). 

[3]  Also,  the  load  balancing  scheme  must  not  force  all  the  processors  in  the  network 
to  guarantee  a  given  level  of  service;  in  particular,  some  processors  may  refuse 
service  or  allow  only  certain  classes  of  tasks  to  be  assigned  to  them.  Thus,  the 
amount  of  autonomy  that  machines  have  under  a  given  policy  is  important. 
(Note  that  the  receiver-initiated  policies  mentioned  above  seem  to  simplify  achiev¬ 
ing  this  goal.) 

[4]  Finally,  the  amount  of  transparency  in  the  load  balancing  scheme  is  also  of  crit¬ 
ical  importance,  since  it  would  be  desirable  to  maintain  users  unaware  of  the  fact 
that  their  jobs  may  be  running  remotely. 

5.  Conclusions 

In  this  paper  we  have  presented  our  ideas  concerning  the  implementation  of  load 
balancing  mechanisms.  We  have  pointed  out  possible  load  metrics  as  well  as  sample 
decision  policies,  and  have  detailed  some  of  the  relevant  issues  involved  in  making  a 
choice  from  the  many  possible  strategies.  It  seems  clear  to  us  that  the  selection  of  a 
load  balancing  strategy  involves  many  complex  choices  which  require  careful  study, 
and  it  is  certainly  not  clear  a  priori  which  strategy  to  pursue. 

At  the  present  time,  we  are  actively  involved  in  a  series  of  experiments  that 
explore  some  of  the  ideas  presented  in  this  paper.  Our  previous  work  ([AlonsolOSGa] 
and  [Alonsol986b])  suggests  that  the  gains  due  to  load  balancing  can  be  quite  sizable, 
and  our  current  results  continue  to  be  promising.  Moreover,  we  feel  that  this  area  of 
research  contains  many  problems  of  both  practical  and  research  interest  and  merit 
further  study. 
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Abstract 

In  this  project  we  study  the  parallel  execution  of  programs  on  a  collection  of 
workstations  connected  by  a  local  area  network.  So  far,  our  work  has  focused  on 
developing  a  number  of  pilot  applications,  including  a  distributed  Pascal  compiler 
and  a  distributed  make  facility.  We  intend  to  use  these  applications  as  an  experi¬ 
mental  testbed  for  studying  issues  in  problem  partitioning,  fault  tolerance,  machine 
allocation,  and  monitoring  and  debugging  of  distributed  programs. 

1  Introduction 

In  recent  years  there  has  been  an  evolution  away  from  centralized  timesharing  systems  towards 
systems  consisting  of  workstations  connected  by  a  high-speed  local  area  network  to  specialized 
server  machines.  The  main  advantage  of  workstation-based  systems  is  that  each  user  has  the 
full  power  of  a  dedicated  machine  at  his  disposal.  Therefore,  performance  depends  less  on  the 
overall  number  of  users  of  the  system. 

Unfortunately,  this  advantage  does  not  come  for  free.  In  particular,  most  workstation- 
based  systems  do  not  allow  convenient  access  to  the  computing  power  of  other  machines  in  the 
system,  even  if  those  machines  are  idle.  Centralized  systems,  in  contrast,  can  typically  make 
all  of  their  idle  resources  available  to  a  single  user.  This  weakness  of  most  workstation-based 
systems  is  particularly  serious  when  one  realizes  that  demand  for  computing  power  is  highly 
bursty  making  strict  preallocation  of  resources  highly  inefficient.  For  instance,  when  compiling 
a  large  collection  of  programs  with  the  UNIX  make  facility,  one  could  easily  use  the  power  of 
many  workstations  if  the  compilations  can  be  run  in  parallel.  Nevertheless,  one  is  essentially 
restricted  to  the  computing  power  of  one’s  own  workstation,  regardless  of  the  fact  that  a  large 
amount  of  computing  power  on  other  machines  may  be  available. 

This  argument  is  not  to  be  interpreted  as  one  against  workstation-based  systems.  Due 
to  the  economics  of  VLSI  and  local  network  technology,  it  is  currently  more  cost  effective  to 
build  ten  2-MIPS  machines  and  connect  them  by  a  suitable  high-speed  network  than  to  build 
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a  single  20-MIPS  machine.  Also,  sharing  a  single  machine  entails  a  substantial  amount  of 
overhead  for  providing  protection  and  fair  sharing  among  users:  a  20-MIPS  machine  shared  by 
ten  users  often  appears  less  powerful  than  ten  dedicated  2-MIPS  machines.  Nevertheless,  we 
see  potential  in  using  a  collection  of  workstations  in  a  more  multiprocessor-oriented  manner. 
Such  an  arrangement  is  especially  attractive  if  the  workstations  are  diskless,  thereby  avoiding 
the  need  to  worry  about  the  owner’s  local  private  storage. 

We  are  using  the  V>System  developed  at  Stanford  University  as  the  base  tool  for  our  re¬ 
search  [5, 4, 2].  The  V-System  is  a  relatively  complete  distributed  operating  system  for  a  collec¬ 
tion  of  SUN  workstations  connected  by  a  10- megabit  Ethernet.  It  consists  of  a  kernel  resident 
on  each  machine  in  the  system  and  a  collection  of  server  processes.  The  kernels  cooperate 
to  provide  transparent  (location  independent)  communication  between  processes.  The  server 
processes  provide  the  other  services  typically  thought  of  as  part  of  the  operating  system,  such 
as  file  access  and  network  access.  Additionally,  a  large  number  of  application  programs  have 
been  written  for  the  system,  including  a  window  system,  text  editors,  a  graphics  editor,  and 
compilers  for  C  and  Pascal.  Much  emphasis  in  the  design  of  the  V-System  has  been  placed  on 
high-speed  interprocess  communication.  We  believe  highly  efficient  (network)  interprocess  com¬ 
munication  is  essential  if  we  are  to  investigate  the  limits  of  the  granularity  of  the  decomposition 
that  we  can  afford  in  a  network  environment. 

The  project  described  here  was  started  in  January  1986  with  funding  from  the  National 
Science  Foundation  [1].  So  far,  we  have  concentrated  on  developing  a  small  set  of  prototype 
parallel  programs.  In  particular,  we  have  developed  several  approaches  to  parallel  compilar 
tion  [3,  7]  as  well  as  a  parallel  implementation  of  the  make  program.  These  applications  are 
described  in  Section  2.  While  we  continue  to  refine  these  applicationr  we  also  intend  to  use 
them  as  a  testbed  for  investigating  issues  such  as  fault  tolerance,  machine  allocation,  and  de¬ 
bugging  and  monitoring  of  distributed  programs.  Our  initial  approaches  to  these  problems  are 
described  in  Section  3. 

2  Pilot  Applications 

2.1  An  Experimental  Parallel  Compiler 

A  compilation  task  is  composed  of  a  number  of  phases  including  scanning,  parsing,  semantic 
analysis,  code  generation,  optimization  and  possibly  assembly.  Our  approach  to  speeding  up 
compilation  is  based  on  the  observation  that  scanning  and  parsing  should  take  relatively  little 
time  compared  to  the  other  phases  of  the  compilation  process.  As  a  result,  scanning  and  parsing 
are  done  sequentially,  while  the  other  phases  are  executed  in  parallel.  Given  this  observation,  we 
need  to  find  a  framework  for  expressing  semantic  analysis  and  code  generation  that  is  amenable 
to  parallel  execution.  We  use  attribute  grammars  for  this  purpose.  In  an  attribute  grammar 
specification,  each  nonterminal  is  annotated  by  a  set  of  attributes.  Semantic  functions  specify 
the  values  of  the  attributes  of  nonterminals  in  a  given  production  in  terms  of  the  values  of  other 
attributes  of  nonterminals  in  the  same  production.  From  a  language  viewpoint,  the  strength  of 
attribute  grammars  result  from  the  fact  that  they  provide  a  high-level,  non-procedural  definition 
of  the  semantics  of  a  programming  language.  Efficient  implementation  of  an  attribute  grammar 
evaluator  on  a  sequential  machine  is  non-trivial,  though.  When  considering  a  parallel  machine, 
however,  the  applicative  nature  of  an  attribute  grammar  specification  lends  itself  particularly 
well  to  efficient  parallel  implementation.  This  applicative  nature  leaves  the  order  in  which 
attributes  are  evaluated  relatively  unconstrained,  and  as  a  result,  synchronization  overhead  is 
kept  to  a  minimum. 


We  have  built  a  prototype  implementation  of  this  system  for  a  sizable  Pascal  subset.  A 
sequential  parser  builds  the  syntax  tree  and  sends  out  subtrees  for  evaluation  to  a  set  of  par¬ 
allel  evaluator  processes.  Each  of  those  processes  evaluates  the  attributes  of  its  subtree  and 
communicates  the  attribute  values  with  other  evaluator  processes.  Finally,  the  root  evaluator 
sends  back  the  “code”  attribute  to  the  parser.  Preliminary  experiments  with  this  prototype 
show  a  speedup  factor  of  3  to  4  for  8  processors  (the  maximum  currently  available  for  our 
experiments),  with  sequential  execution  speeds  comparable  to  conventional  compilers  [3]. 

Much  work  remains  to  be  done  on  this  specific  problem.  First,  we  need  to  get  a  better 
idea  of  the  behavior  of  the  parallel  compiler  in  terms  of  computation,  communication  and 
synchronization.  Second,  the  algorithm  used  in  the  evaluator  is  a  straightforward  dynamic 
algorithm.  State-of-the-art  sequential  evaluators  use  a  static  evaluation  algorithm  whereby 
dependency  analysis  is  done  at  generation  time  and  not  at  execution  time.  We  wish  to  explore 
the  combination  of  static  and  parallel  evaluation.  Finally,  a  full  implementation  of  a  language 
relevant  to  our  research  is  desirable  so  that  we  can  use  the  parallel  compiler  in  our  daily  work. 

2.2  A  Parallel  Make  Facility 

The  standard  UNIX  make  facility  automates  the  compilation  of  source  code  files  (and  other  types 
of  processing)  by  allowing  the  programmer  to  specify  the  dependencies  among  the  various  files 
that  make  up  a  program  and  record  instructions  about  now  the  results  are  to  be  built.  In  the 
traditional,  sequential  version  of  make,  the  system  builds  one  target  at  a  time.  As  a  result, 
complex  targets  that  require  the  execution  of  many  commands  require  a  large  amount  of  time. 
Since  a  large  number  of  these  commands  tend  to  be  independent  (for  instance,  compilations  of 
unrelated  modules),  many  of  these  commands  can  be  executed  in  parallel.  Parallel  execution 
is  accomplished  by  sorting  the  task  dependency  graph  in  topological  order  and  starting  up  as 
many  commands  in  parallel  as  possible  within  the  constraints  of  the  dependency  graph.  The 
main  “controller”  process  awaits  completion  of  the  subtasks,  updates  the  dependency  graph 
accordingly,  and  starts  up  new  commands  that  have  become  ready. 

The  speedup  resulting  from  parallel  execution  is  very  dependent  on  the  actual  dependency 
graph.  A  single  large  compilation  combined  with  a  number  of  small  ones  tends  to  produce 
only  moderate  speedups,  even  with  large  numbers  of  machines  available.  On  the  other  hand, 
if  a  large  number  of  compilations  of  approximately  equal  length  are  required,  nearly  linear 
speedup  can  be  achieved  with  a  large  number  of  processors.  Early  experiments  have  also  shown 
very  significant  loads  on  the  file  server  as  a  result  of  a  parallel  make  (all  of  our  workstations 
are  diskless).  This  suggests  the  need  for  improved  caching  strategies  on  the  workstations  and, 
perhaps,  the  need  for  parallel  program  loading  if  a  broadcast  network  is  used. 

3  Research  Issues 

3.1  Fault  Tolerance 

In  recent  years,  a  number  of  fault-tolerance  mechanisms  for  distributed  systems  have  been 
developed.  However,  most  of  these  mechanisms  either  restrict  the  types  of  applications  that 
can  !>.  run  (for  example,  requiring  them  to  be  structured  as  a  series  of  atomic  transactions) 
or  require  specialized  hardware  to  support  the  communication  between  machines  needed  for 
the  fault  tolerance.  Also,  these  mechanisms  tend  to  place  a  large  amount  of  overhead  on  the 
system  in  both  increased  computation  and  communication  costs.  What  we  would  like  is  a  simple, 
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low-overhead  technique  to  achieve  fault  tolerance  that  can  be  used  with  existing  distributed 
applications  programs  and  standard  workstation  and  local  network  hardware. 

A  popular  method  of  achieving  fault  tolerance  is  to  periodically  checkpoint  each  process 
(individually)  and  to  log  all  message  traffic  which  a  process  receives  either  on  disk  or  with  some 
special  logging  process  on  a  separate  machine.  Then,  if  a  process  fails,  it  can  be  restarted  from 
its  most  recent  checkpoint,  and  the  messages  which  have  been  logged  since  that  checkpoint  can 
be  replayed  to  it,  thus  restoring  its  internal  state  to  the  point  at  which  the  fault  occurred.  The 
frequency  of  checkpointing  can  be  tuned  to  balance  the  expense  of  the  checkpoint  operations 
against  the  time  needed  for  recovery,  but  the  logging  must  be  done  for  each  message  that  each 
process  receives,  placing  a  continuous  overhead  on  the  operation  of  the  system  even  in  the  case 
where  no  faults  ever  occur. 

The  solution  we  are  exploring  is  based  on  an  analysis  of  the  minimum-cost  method  of 
accomplishing  this  logging  [6].  Since  the  sender  and  receiver  processes  get  a  copy  of  each 
message  exchanged  between  them,  it  would  be  convenient  if  one  of  them  could  serve  as  the 
log  by  saving  a  copy  of  the  message  in  its  own  local  memory.  Unfortunately,  the  receiver  can 
not  log  the  message  in  this  way  since  the  purpose  of  the  logging  is  to  be  able  to  recover  if  the 
receiver  fails.  The  sender,  though,  could  log  the  message,  except  that  it  is  also  necessary  to 
log  the  order  in  which  messages  were  received  which  the  sender  does  not  normally  know.  We 
believe  that  by  having  the  receiver  reply  back  to  the  sender  with  a  message  ordinal  number,  we 
can  have  the  necessary  recovery  information  available  at  very  little  cost  in  performance. 

3.2  Machine  Allocation 

Allocation  of  computational  tasks  to  processors  in  a  multi-machine  environment  has  received 
a  great  deal  of  attention,  especially  in  theoretical  work.  However,  relatively  little  experience 
exists  with  real  implementations.  We  hope  to  provide  some  experimental  evidence  on  the 
relative  merits  of  different  techniques  that  have  been  studied  in  a  theoretical  framework. 

Given  our  environment,  the  problems  of  machine  allocation  are  compounded  by  a  number 
of  extra  complications.  First,  the  machines  ace  not  dedicated  to  compute-intensive  work.  In 
first  order,  these  machines  should  provide  highly  responsive  interaction  to  their  owners.  These 
“guest”  computations  should  not  be  allowed  to  degrade  interactive  response,  but  at  the  same 
time  the  execution  time  of  the  compute-bound  jobs  should  be  minimized.  Second,  due  to  the 
relative  abundance  of  CPU  power  when  all  machines  are  combined  and  the  fact  that  compute- 
bound  tasks  are  generated  rather  sporadically,  many  of  the  common  assumptions  made  in 
previous  studies  do  not  seem  to  hold.  For  instance,  CPU  run  queue  length  might  not  necessarily 
be  a  very  valuable  indicator  of  load  on  a  workstation,  since  it  usually  will  be  either  0  or  1.  There 
also  seems  to  be  little  correlation  between  present  and  future  load,  unlike  on  timesharing  systems 
where  the  large  number  of  users  smooths  out  load  fluctuations. 

3.3  Monitoring  and  Debugging 

Most  of  the  measurements  on  our  current  distributed  programs  are  done  by  ad  hoc  techniques. 
Counters  are  inserted  in  selected  places  in  the  program,  and  network-wide  monitors  are  occa¬ 
sionally  useful  in  getting  system-wide  measurements.  More  sophisticated  tools  for  debugging 
and  performance  monitoring  are  essential,  however.  We  are  interested  in  getting  two  kinds  of 
measurements.  First,  we  would  like  to  get  a  better  characterization  of  local  network  traffic  in  an 
environment  of  diskless  workstations,  especially  when  those  workstations  are  heavily  used  for 
parallel  program  execution.  Older  studies  done  in  a  less  integrated  environment  indicate  that 
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the  load  on  a  local  network  tends  to  be  very  low.  In  an  environment  of  diskless  workstations 
where  all  secondary  storage  is  accessed  over  the  network  and  where  programs  execute  on  dif¬ 
ferent  machines,  this  evaluation  needs  to  be  reassessed.  Second,  we  are  interested  in  exploring 
whether  there  is  a  way  to  combine  communication  traces  with  more  traditional  techniques  for 
debugging  distributed  programs. 
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Preface 

At  the  request  of  the  National  Aeronautics  and  Space  Administration,  the 
National  Research  Council’s  Aeronautics  and  Space  Engineering  Board  conducted 
a  workshop  in  January  1984  to  project  what  the  state  of  knowledge  in  aeronauti¬ 
cal  technology  could  be  in  the  year  2000,  if  necessary  supporting  resources  were 
made  available.  Some  80  experts  participated  in  the  study.  They  were  organized 
into  eight  panels  in  the  areas  of  aerodynamics;  propulsion;  structures;  materials; 
guidance,  navigation,  and  control;  computer  and  information  technology;  human 
factors;  and  systems  integration. 

This  author  chaired  the  panel  on  computer  and  information  technology. 
This  paper  provides  a  brief  summary  of  the  major  findings  of  this  panel  together 
with  some  more  recent  observations.  The  full  report  of  the  panel  can  be  found  in 
the  report  Aeronautics  Technology  Possibilities  for  2000:  Report  of  a 
Workshop  (available  from  the  Aeronautics  and  Space  Engineering  Board,  Com¬ 
mission  on  Engineering  and  Technical  Systems,  National  Research  Council,  2101 
Constitution  Avenue,  N.W.  Washington,  D.C.  20418).  A  copy  of  the  report  of 
the  computer  and  information  technology  panel  is  included  with  this  paper  as  an 
Appendix.  A  paper  based  on  this  report,  Computer  and  Information  Tech¬ 
nology  in  the  Year  2000  -  A  Projection,  by  Lundstrom  and  Larsen  is  avail¬ 
able  in  IEEE  Computer  Magazine,  September  1985,  Vol.  18,  No.  9,  pp.  68-79. 
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Computers  are  playing  significantly  larger  roles  in  many  application  domains  today.  This 
growth  is  expected  to  continue  since  information  processing  technology  has  advanced  by  a  factor 
of  10  every  5  years  for  the  past  35  yean  and  is  expected  to  continue  to  do  so.  Breakthroughs  in 
device  technology,  from  vacuum  tubes  through  transistors  to  integrated  circuits,  contribute  to  this 
rapid  pace.  This  progress  is  nearly  matched  by  similar,  though  not  as  dramatic,  advances  in 
numerical  software  and  algorithms.  Progress  has  not  been  easy.  Many  technical  and  non¬ 
technical  challenges  were  surmounted.  The  outlook  is  for  continued  growth  in  capability,  but  will 
require  surmounting  new  challenges. 

The  technology  forecast  presented  in  this  paper  was  developed  by  extrapolating  current 
trends  and  assessing  the  possibilities  of  several  high-risk  research  topics.  In  the  process,  critical 
problem  areas  that  require  resarch  and  development  emphasis  have  been  identified.  The  outlook 
assumes  a  positive  perspective;  the  projected  capabilities  are  possible  by  the  year  2000,  and  ade¬ 
quate  resources  will  be  made  available  to  achieve  them. 

The  following  summary  of  the  outlook  to  the  year  2000  is  taken  from  the  Computer  Magazine 
article  cited  above. 

The  capabilities,  on  the  hardware  side,  will  be  based  on  components  that  are  100  to  1000 
f  times  more  cost-effective  and  size-effective  than  those  available  today,  as  well  as  on  com¬ 

ponents  with  major  new  capabilities  such  as  arrays  of  solid-state  sensors.  Airborne  systems 
will  be  synthesized  from  heterogeneous  collections  of  processors,  some  of  which  will  execute 
at  rates  up  to  1,000  MIPS  and  will  contain  50  Mword  ->f  random  access  memory.  Product 
designers  will  have  the  use  of  workstations  (with  local  processing  rates  of  100  to  1000  MIPS 
on  32-bit  operands,  with  50  Mword  RAM,  and  rotating  storage  of  10  to  100  Gwords)  and 
supercomputers  (composed  of  100  to  100,000  processing  elements,  with  cumulative  process¬ 
ing  power  of  up  to  1000  Gflops  and  containing  up  to  10  Gwords  of  RAM).  These  various 
systems  will  take  advantage  of  software  that  will  support  distributed,  heterogeneous  data¬ 
bases,  real-time  flight  management  and  crew  assistance,  as  well  as  multidisciplinary  design 
and  manufacturing  support  tools. 

Computer  and  information  technology  is  supporting  and  enabling  technology  to  the 
aeronautics  industry  as  well  as  many  others  of  course,  both  directly  and  indirectly.  The 
technologies  considered  by  the  panel  were  those  related  to  application  within  aeronautics 
products  directly,  to  research  leading  to  the  development  of  technologies  needed  for  future 
aeronautics  products,  to  the  design,  development,  and  manufacturing  of  the  aeronautics  pro¬ 
ducts,  to  the  support  of  these  products  including  maintenance,  diagnostics,  and  provisioning, 
and  to  the  use  of  the  products,  including  support  of  the  impact  on  the  flight,  crew.  The 
benefits  of  computing  are  realized  (indirectly,  usually)  through  disciplines  such  as  aero¬ 
dynamics,  and  guidance,  navigation  and  control.  Therefore,  the  impact  of  computing  on 
aerospace  technology  or  technology  in  other  areas  will  be  in  direct  proportion  to  the  extent 
to  which  these  supporting  disciplines  exploit  computing. 


Assimilation  of  the  technology  into  engineering  practice  will  continue  to  be  a  major  chal¬ 
lenge.  The  critical  technical  impediments  will  be  trustworthiness,  size/performance,  and 
management  of  complexity.  The  major  challenges  are  likely  to  come  from  the  areas  of 
software  development,  system  verification  and  validation,  communication  and  control,  and 
programs  used  to  augment  human  intelligence,  such  as  expert  systems,  theorem  proving,  and 
symbolic  mathematics. 

Possible  Impact  of  Advancing  Technology 

The  following  paragraphs  are  based  on  information  obtained  since  the  report  summarized 
above  was  prepared. 

Engineering/Manufacturing  Infrastructure 

A  major  problem  in  many  industries  is  the  shortening  product  life  cycle  combined  with 
future  products  which  are  more  complex  and  seem  to  require  more  time  and  resources  for  their 
development.  A  dramatic  reduction  in  development  costs  is  critically  needed.  Major  reductions 
in  development  costs  would  include  significantly  reduced  development  time,  allowing  first  delivery 
of  a  product  to  be  much  earlier  in  the  product  life  cycle.  Future  computer  and  information  tech¬ 
nology  is  expected  to  play  a  major  role  in  accomplishing  such  a  reduction.  Some  of  the  specific 
areas  where  major  contributions  can  be  expected  are: 

1.  Rapid  Turn-Around  Design, 

2.  Flexible  Manufacturing, 

3.  Integrated  Vendors,  and 

4.  Micro  Manufacturing. 

All  of  the  above  factors  relate  to  reducing  product  cost,  either  by  reducing  product  develop¬ 
ment  costs,  or  by  reducing  the  cost  of  manufacturing.  In  addition,  all  of  these  factors  will  be 
enabled  by  advancing  computer  and  information  technology.  Consider  each  of  these  briefly: 

Rapid  Turn-Around  Design  Computer  and  Information  Technology  will  significantly  change 
the  way  in  which  complex  designs  are  approached.  New  means  for  groups  of  people  to  efficiently 
interact  during  the  design  process  will  be  developed.  The  computer-based  tools  will  significantly 
enhance  the  individual’s  capacity  to  manage  complexity.  As  a  result,  product  design  will  require 
significantly  less  manpower  for  a  given  level  of  complexity,  leading  to  reductions  in  the  product 
design  cost  and  to  reductions  in  the  time  needed  to  complete  a  design.  Computer  tools  will 
enable  the  implementation  of  special  purpose  hardware  as  easily  as  software  can  be  developed. 
These  advances  should  reduce  the  time  to  introduction  of  a  product,  thus  reducing  the  cost  com¬ 
mittments  to  the  development  of  the  product  before  cash  return  can  be  realized. 

Flexible  Manufacturing  The  advent  of  general  purpose  manufacturing  handlers  and  tooling 
will  allow  the  same  manufacturing  facilities  to  be  utilized  in  the  production  of  many  different  pro¬ 
ducts,  simultaneously.  An  example  of  this  is  how  compiler  technology  is  being  used  to  develop 
such  a  capability  in  the  laboratories  at  the  Center  for  Integrated  Systems  at  Stanford  University. 
Dr.  Brian  Reid  has  been  directing  the  development  of  a  new  language,  called  FABLE.  FABLE  is 
an  integrated  circuit  manufacturing  process  specification  language.  When  a  designer  has  com¬ 
pleted  the  design  of  an  integrated  circuit,  both  the  mask  set  (or  mask  set  definition),  and  the 
specification  of  the  manufacturing  process  are  released  to  the  semiconductor  ’’foundry”.  The 
mask  set  controls  the  patterning  on  the  integrated  circuit  device  being  produced.  The  FABLE 
process  specification  is  compiled  into  modules  which  control  the  various  production  facilities  in 
order  to  exactly  duplicate  the  desired  manufacturing  conditions.  The  concept  of  silicon  foundries, 
where  one  vendor  provides  a  manufacturing  service  to  a  designer,  can  be  expected  to  be  utilized 
well  in  other  manufacturing  areas  as  well. 

Integrated  Vendors  Both  the  design  and  manufacturing  process  of  large,  complex  systems 
often  require  the  use  of  vendors  (both  during  design  and  manufacturing).  If  the  computer  tools  (in 


design,  manufacturing,  and  accounting)  of  both  the  manufacturer  and  the  vendors  are  appropri¬ 
ately  interfaced,  the  information  flow  between  companies  (now  handled  with  slow  manual  pro¬ 
cedures)  could  be  streamlined  with  a  resulting  reduction  in  the  overall  time  for  product  develop¬ 
ment.  If  the  manufacturer  and  vendor  computer  systems  are  integrated  appropriately, 
significantly  fewer  errors  would  be  expected  since  all  parties  involved  would  have  exactly  the  same 
information,  with  much  less  chance  for  human  error  in  the  dissemination  and  coordination  of  the 
information. 

Micro  Manufacturing  Once  manufacturing  facilities  are  automated  and  generalized 
sufficiently  that  a  broad  range  of  manufacturing  duties  can  be  accomplished  on  the  same  equip¬ 
ment,  small  flexible  factories  can  be  expected  to  be  created  near  the  major  centers  of  demand. 
These  small,  flexible  factories  would  receive  specifications  of  what  to  manufacture  through  an 
electronic  ordering  process.  Finished  goods  would  already  be  near  the  customer,  with  a  significant 
reduction  in  shipping  costs. 

Impact  of  Non- Technical  Decisions 

One  of  the  challenges  of  any  long-range  forecast  is  to  second  guess  what  the  actual  motivat¬ 
ing  conditions  will  be  to  actually  drive  the  development  of  technology.  These  conditions  were 
studiously  avoided  in  the  report  summarized  earlier.  Following  is  a  brief  discussion  of  three  possi¬ 
ble  motivating  conditions  and  their  impact  on  the  advance  of  computer  and  information  technol¬ 
ogy 

1.  Oil  Crisis 

2  Accounting/Office  Management 

3  Factory  Automation 

A  sustained  oil  crisis  would  probably  result  in  neighborhood  centers  containing  offices  of 
nearby  residents,  micro  factories,  child  care  centers,  post  office,  neighborhood  shops,  etc.  Locally 
centered  life  would  not  only  reduce  the  consumption  of  oil,  the  requirements  for  interconnected 
offices  and  businesses  would  grow  substantially.  Community  center  workplaces  and  micro  fac¬ 
tories  require  significant  advances  in  distributed  computing  as  well  as  security  so  that  the  many 
companies  represented  in  such  a  heterogeneous  workplace  could  feel  confident  that  their  portion 
of  the  workplace  is  free  from  spying  eyes. 

As  the  cost  of  computing  plummets  even  faster,  the  traditional  tradeoffs  between  the  cost  of 
employees  and  the  tools  which  they  use  changes  rapidly.  The  productivity  of  people  in  the  work¬ 
place,  even  in  many  office  management  and  administrative  roles,  will  become  ever  more  impor¬ 
tant.  Computer  and  information  technology  will  be  used  to  significantly  improve  the  productivity 
of  these  workers. 

The  factory  automation  necessary  to  implement  the  flexible  factories  and  the  micro  factories 
will  likely  cause  displacement  of  workers.  This  rapid  change  can  be  expected  to  encourage  the 
development  of  a  new,  computer-based  education  industry  to  serve  the  needs  to  rapidly  retrain 

displaced  workers. 

The  Future  -  2000  and  Beyond 

A  new  computer  organization,  inductive  machines,  may  be  available  in  the  future.  This 
term,  coined  by  Lipovski  and  Malek  at  the  University  of  Texas,  refers  to  a  system  which  can 
begin  with  a  functionally  complete  system  containing  at  least  one  of  each  subsystem.  The  system 
can  then  be  expanded  by  adding  a  copy  of  the  initial  system  with  the  resulting  system  acting  the 
same  as  the  basis  system,  but  faster.  As  more  and  more  speed  is  needed,  additional  copies  of  the 
initial  basis  system  are  added.  At  present,  such  an  inductive  organization  seems  feasible, 
although  the  system’s  speed  may  not  increase  as  fast  as  the  costs  when  the  size  is  increased 
However,  the  existence  of  inductive  applications  is  not  yet  clear 


In  today’s  terminology,  we  say  that  a  computer  "solves”  our  problem.  However,  to  be  more 
precise,  we  start  with  a  problem,  we  determine  how  to  solve  the  problem,  we  prepare  a  set  of  pre¬ 
cise  instructions  to  a  computer  which,  when  executed,  will  implement  the  solution  which  we 
invented.  With  the  advent  of  complex  expert  systems  in  the  future,  users  should  be  able  to 
approach  a  computer  with  a  description  of  their  problem  and  expect  the  computer  to  determine 
how  to  solve  it.  We  might  say  that  such  a  computer  is  "problemmed”  rather  than  "pro¬ 
grammed”  . 

When  trying  to  determine  what  markets  will  motivate  developments  in  the  future,  one  can 
identify  many  new,  potential  uses  of  computers  and  systems  which  they  will  be  a  part  of.  We 
should  not  forget,  however,  that  businesses  will  still  be  doing  accounting,  with  this  possibly  being 
the  most  prevelent  application  of  all  since  each  business,  and  possibly  each  worker  would  need  to 
be  using  such  services. 

A  last  mention  about  an  area  often  overlooked  in  discussions  of  future  directions  in  comput¬ 
ing,  that  of  real-time  processing.  Real-time  applications  historically  have  not  attracted  the  atten¬ 
tion  of  researchers  interested  in  the  development  of  tools  and  architectures  suited  to  such  systems. 
Applications  are  predominately  developed  using  assembly  language  because  the  time  cost  of  each 
instruction  in  a  program  is  easily  identified.  However,  once  highly  concurrent  systems  are  avail¬ 
able,  even  the  use  of  assembly  language  will  not  be  sufficient  to  determine,  a  priori,  the  actual 
flow  of  execution  across  time.  New  tools,  from  language  constructs  to  define  time  orders,  con¬ 
straints  and  dependencies  in  applications  to  time  constraint  enforcers  in  future  systems,  are  badly 
needed.  Enough  research  is  now  beginning  that  these  needs  may  also  be  addressed  reasonably 
well  by  the  start  of  the  twenty-first  century. 
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1 .  Summary 


The  aim  is  to  construct  a  powerful  distributed  query  processing  system, 
which  is  suitable  for  local  area  networks,  long-haul  networks  and  mixed 
networks  where  local  area  networks  are  interconnected  by  long  distance  lines. 
The  system  will  be  capable  of  processing  logical  queries  (e.g.,  Prolog)  as 
well  as  ordinary  relational  queries  (e.g.,  Ingres  types  of  queries),  be  able 
to  learn  to  process  queries  more  efficiently  through  being  use  and  be  able  to 
adapt  itself  to  varying  environments  (e.g.,  partial  system  crash 
surviability ) . 

2.  Present  status:  Semi- join  algorithm  and  replicate  algorithm 

Distributed  query  optimization  is  an  important  factor  for  the  performance 
of  a  system  with  databases  distributed  in  a  network.  Many  distributed  query 
processing  algorithms  [ApHY,  BeCh,  BeGo,  CDFG,  ChHo,  BILu,  ChBH ,  Chanl ,  Chan2, 
ChLil ,  ChLi2,  ChLi3,  EpSW,  GoSH,  HeYa,  Luk2,  SDD1 VI ,  SDD2V2 ,  WoKa,  Wong,  WOHL, 
Yaos,  YCTB,  YLCC,  YuOz]  have  been  proposed.  Most  of  them  can  be  classified 
into  two  categories.  The  first  category,  semi'-join  algorithms  [ApHY,  BeCh, 
BeGo,  CDFG,  ChHo,  BILu ,  ChBH,  Chanl,  Chan2,  ChLil,  ChLi2,  ChLi3,  GoSH,  HeYa, 
Luk2,  SDD1V1,  SDD2V2 ,  WoKa,  Wong,  WOHL,  Yaos,  YCTB,  YLCC,  YuOz],  emphasizes  on 
the  reduction  of  the  amount  of  data  transferred  across  sites  in  networks. 
These  algorithms  are  suitable  for  databases  distributed  in  long-haul  networks. 
The  other  type  of  algorithms,  replicate  algorithms  [EpSW,  YGCC,  YGZT,  TBCD], 
emphasizes  on  the  reduction  of  local  processing  cost  by  promoting  parallel 
processing  in  different  sites,  though  canmuni cation  cost  is  taken  into 
consideration.  These  algorithms  are  suitable  for  databases  distributed  in 
local  area  networks.  We  will  briefly  describe  what  we  have  accomplished  in 
these  two  categories.  Our  algorithms  [YCTB,  YGCC,  YGZT]  are  intended  for  ad 
hoc  queries,  unlike  enuraerative  algorithms  ([ChHu,  LMHL,  WOHL])  which  are  more 
suitable  for  repetitive  queries.  Standard  relational  terminology  can  be  found 
in  [Date,  Ullm]. 

2.1.  Semi- join  algorithm 

A  semi- join  from  relation  R^  ^o  relation  Rj  on  attribute  A  [BeCh], 
denoted  by  Rj^  -a->  r  defined  to  be  the  join  af  R^  with  Rj  on  A  projected 

back  onto  the  attributes  of  R,  if  Ri  and  Rj  are  in  different  sites,  the 
semi-join  operation  can  beJ  performed  byJ  projecting  R.  on  a,  sending  the 
resulting  unary  relation  to  the  site  containing  R ^  anCj  then  joining  with  the 
relation  R  on  A<  jn  this  way,  the  amount  of  data  transfer  will  be  smaller 
than  that  of  sending  the  entire  relation  R^#  Furthermore,  the  semi-join 


operation  eliminates  tuples  of  R,  that  do  not  have  a  common  value  with  R; 
projected  on  A.  Thus,  the  operationJ always  reduces  the  right-hand  operand,  so 
that  if  it  is  to  be  transferred  to  another  site,  less  data  transfer  will  be 
required . 

The  algorithm  has  been  reported  in  [YCTB,  YCTBL].  It  is  similar  to  many 
existing  distributed  query  processing  algorithms  [ApHY,  BILu,  Chanl ,  Cnan2, 
ChHo,  ChLil,  ChLi2,  ChLi3,  HeYa,  SDD1  ,  etc.]  in  executing  a  sequence  of  semi¬ 
joins  [BeCh]  to  reduce  the  sizes  of  relations  referenced  by  the  query,  before 
assembling  them  in  a  site  to  obtain  the  answer.  However,  it  has  the  following 
additional  desirable  features:  (i)  The  algorithm  determines  the  relations 
that  need  not  participate  in  later  operations  after  the  execution  of  some 
semi-joins;  as  a  result,  communication  and  local  processing  costs  due  to  the 
processing  of  those  relations  are  saved.  (ii)  All  semi-joins  involving 
relations  identified  in  (i)  can  either  be  discarded  or  replaced  by  better 
semi-joins;  as  a  consequence  of  (i)  and  (ii),  the  algorithm  can  be  shown  to 
yield  better  performance  than  the  SDDr  algorithm  [YCTB1 ,  YCTBL],  (iii)  For 
each  semi-join  involving  horizontally  fragmented  relations,  it  determines  a 
strategy  with  least  communication  cost  to  execute  the  semi-join,  (iv)  It 
makes  use  of  redundant  copies  of  relations  to  reduce  communication  cost  and 
promote  parallelism,  (v)  Semi-joins  are  executed  dynamically  (i.e.  as  soon  as 
a  semi-join  is  planned,  it  is  executed  and  the  number  of  tuples  of  the  reduced 
relation  is  returned)  so  that  the  errors  in  estimating  the  cost  and  the 
benefit  of  executing  the  next  semi-join  are  reduced.  Furthermore,  if 
execution  errors  or  site  failures  are  encountered,  alternate  strategies  may  be 
invoked . 

2.2.  Replicate  algorithm 

The  replicate  algorithm  [YGCC,  YGZT]  makes  use  of  semanatic  information 
so  that  certain  distributed  queries  can  be  processed  locally  without  data 
transfer  with  respect  to  the  join  clauses  of  the  query.  A  simple  algorithm  is 
given  to  recognize  the  "locally  processable  queries".  For  non-local 
processable  queries,  a  simple  "fragment  and  replicate”  approach  [EpSW]  is 
used.  The  approach  chooses  a  relation  to  remain  fragmented  while  replicating 
the  other  relations  referenced  by  the  query  at  the  sites  of  the  fragmented 
relations.  However,  unlike  [EpSW],  our  algorithm  takes  into  considerations 
not  only  the  amount  of  data  processed  and  transferred  but  also  the  presence  or 
absence  of  fast  access  paths  (e.g.  indices),  which  have  a  very  significant 
effect  on  local  processing  cost,  and  different  processing  speed  at  different 
3ites.  If  the  fragments  of  the  chosen  relation  are  not  moved,  then  a  linear 
time  algorithm  that  minimizes  the  response  time  can  be  found;  otherwise,  the 
problem  can  be  shown  to  be  NP-hard  [YGCC],  In  the  latter  situation,  heuristics 
are  suggested.  Experimental  results  given  in  [YGZT]  show  that  the  heuristics 
give  good  approximation.  Details  can  be  found  in  [YGCC,  YGZT]. 

When  all  the  relations  referenced  by  a  query  are  unfragmented,  it  may  be 
desirable  to  partition  a  relation  into  fragments  before  the  replicate  strategy 
is  applied.  An  optimal  algorithm  to  choose  the  partitioned  relation  and  a  set 
of  processing  sites  Is  given  in  [YGBC]. 

3.  Future  directions 


We  believe  that  the  semi-join  algorithm  and  the  replicate  algorithm 
should  be  modified  to  allow  (a)  integration,  (b )  adaptation  and  (c)  be  capable 


of  processing  logical  queries. 

(a )  .  Integration 

Both  replicate  algorithm  and  semi-join  algorithm  suffer  from  being  too 
restrictive  in  the  sense  that  they  are  suitable  for  only  a  particular  network 
environment  (i.e.,  local  area  networks  or  long-haul  networks)  and  do  not  take 
advantage  of  the  features  of  one  another.  It  is  believed  that  a  semi-join 
operation  is  useful  in  reducing  not  only  communi cai ton  costs  but  also  local 
processing  costs  [VaGa],  Furthermore,  if  a  relation  can  be  eliminated  after 
executing  some  semi-joins  lYCTB]  ,  the  chance  of  lowering  further  local 
processing  costs  and  canmunication  costs  is  greatly  enhanced.  We  are 
developing  an  integrated  algorithm  (integrate  semi-join  and  replicate 
algorithms)  [YuGC].  The  feature  of  the  algorithm  is  briefly  described  as 
follows . 

The  concepts  of  cost,  benefit  and  profit  (benefit  -  cost)  of  executing  a 
semi-join  is  also  used  as  in  other  semi-join  algorithms  for  long-haul  networks 
(e.g.,  [ SDD1 ,  YCTB] ,  except  they  are  generalized  in  two  ways.  First,  local 
processing  cost  as  well  as  data  communication  cost  are  incorporated.  Second, 
benefits  and  profits  are  defined  at  each  processing  site  (as  determined  by  the 
replicate  algorithm).  We  also  introduced  the  concept  of  the  relative  profit. 
Let  Res  be  the  response  time  and  Total  be  the  total  cost  (communication  costs 
and  local  processing  costs)  at  a  site  before  performing  a  semi-join.  The 
relative  profit  at  this  site  due  to  performing  a  semi-join  is  defined  to  be 
Res  -  Total  +  profit.  The  smallest  relative  profit  among  those  of  a  semi-join 
at  all  processing  sites  is  defined  to  be  the  minimum  relative  profit  of  the 
semi-ijoin.  If  the  minimum  relative  profit  of  a  semi-join  is  greater  than 
zero,  then  the  response  time  can  be  reduced  by  executing  the  semi-join.  In 
other  words,  for  those  sites  which  have  total  costs  close  to  Res  before 
executing  the  semijoin,  we  need  profits  >  0  to  find  the  semijoin  worthwhile 
executing.  But  for  those  sites  which  have  total  costs  much  less  than  Res, 
though  the  profits  may  be  less  than  0,  the  semijoin  is  still  worth  executing 
as  long  as  its  final  response  time  <  Res.  The  extension  of  the  concept  of  the 
minimum  relative  profit  of  a  semi-join  provides  a  vehicle  to  select  semi¬ 
joins.  The  semi'-join  having  the  largest  minimum  relative  profit  among  all 
possible  semi- joins  is  chosen  first,  then  the  process  is  repeated  to  choose 
the  next  semi-join  having  the  largest  minimum  relative  profit  and  so  on.  In 
this  way,  semi-joins  are  incorporated  into  the  replicate  algorithm. 

The  integrated  algorithm  is  to  be  used  not  only  in  local  area  networks  or 
long-haul  networks  but  also  in  mixed  networks.  If  a  query  can  be  processed 
within  a  single  local  network,  then  the  Integrated  approach  given  above  is 
likely  to  yield  a  good  strategy.  Suppose  that  a  query  has  different  fragments 
spanning  a  number  of  local  networks.  Each  query  fragment  may  be  processed 
using  the  integrated  replicate  algorithm.  Then  the  query  unifying  the 
different  query  fragments  may  be  processed  using  the  semi-join  algorithm. 
Whether  this  is  a  reasonable  approach  has  yet  to  be  investigated. 

(b )  .  Adaptation 

A  powerful  query  optimizer  should  be  able  to  yield  not  only  a  good 
strategy  for  processing  a  query  but  also  adapt  itself  to  varying  environments 
by  acquiring  knowledge  from  executing  previous  queries. 


There  are  a  number  of  situations  in  which  tne  experience  ottainec  in  tr.e 

execution  of  queries  will  be  useful  in  speeding  up  the  execution  of  future 

queries.  Some  of  these  are  given  as  follows. 

(i)  After  executing  a  semi-join  of  the  form  _A„R  f  we  find  that  R.  is 

not  reduced,  then  R  .  (A)  known  to  be  a  subset  of  R ^ ( A ) .  This  allows 

us  to  save  in  the  processing  of  certain  queries  containing  join  clauses 
of  the  form  •  A  -  R  •  A.  More  information  can  be  found  in  [YLGT]. 

(ii)  The  cost  of  executing  a  step  in  each  of  our  two  algorithms  (the 

replicate  and  the  semi-join  algorithms)  is  estimated.  When  the  step  is 

executed,  the  actual  cost  is  returned  due  to  the  dynamic  nature  of  our 
algorithms.  If  there  is  a  significant  difference  between  the  estimated 
cost  and  the  actual  cost,  the  estimated  cost  is  increased  or  decreased 
by  a  fraction  of  the  difference  between  the  two  costs.  As  an  example, 
suppose  the  step  is  to  execute  a  data  transfer  in  a  long-haul  point-to- 
point  network  and  the  actual  cost  is  a  lot  higher  than  the  estimated 
cost.  Then  the  phenomenon  may  indicate  that  the  traffic  between  the  two 
sites  is  congested.  The  estimated  cost  will  be  increased  and  as  a 
result,  data  transfer  between  the  two  sites  in  the  near  future  may  be 
replaced  by  data  transfer  involving  different  sites  if  the  alternative 
is  acceptable  and  cheaper.  Similarly,  if  the  step  is  the  execution  of 
some  local  operations  at  a  site  and  is  found  to  be  a  lot  more  expensive 
than  estimated,  then  the  site  may  be  overloaded.  With  the  estimated 
cost  increased,  equivalent  operations  will  automatically  be  transferred 
to  other  sites  sc  as  to  achieve  load  balancing  [CaLL.Ston].  More 

details  can  be  found  in  [YLGT]. 

(iii)  It  is  possible  that  there  are  seme  situations  in  which  the  algorithms 
may  perform  poorly,  especially  in  a  mixed  network  environment.  Some 
users,  having  certain  local  knowledge  may  be  able  to  provide  strategies 
for  some  queries  with  better  performance  than  our  algorithm.  We  have 
designed  a  graphical  interface  [YLGW]  in  which  users  can  specify  their 
strategies  by  stating  the  sequence  of  semi-joins  to  be  executed,  or  the 
relation  that  is  to  remain  fragmented  and  their  processing  sites,  or 
some  combination  of  the  two.  The  user  strategy  is  executed.  If  the 
actual  cost  of  the  user's  strategy  is  significantly  less  than  the 
estimated  cost  of  the  system's  strategy,  then  the  user  will  be  asked  to 
explain  the  basis  of  his/her  strategy.  The  user's  strategy  will  also  be 
saved  by  the  system  for  later  examination.  In  some  situations,  an 
automatic  analysis  is  performed  and  the  user  is  not  required  to  provide 
the  reasons  for  his/her  strategy.  It  is  hoped  that  knowledge  which  is 
not  exploited  by  our  algorithms  will  be  unveiled  by  either  experienced 
or  lucky  users  and  then  the  knowledge  will  be  acquired  by  our 
al gorithms . 


In  addition,  query  optimization  algorithms  in  highly  available  systems 
(e.g.  [Kim,  BhLi])  must  be  able  to  adapt  to  the  current  status  of  the  sites 
and  communication  lines.  In  particular,  they  must  respond  to  system  failures( 
when  primary  location  contents  are  lost),  media  failures  (  when  volumes  of 
secondary  storage  are  lost),  and  network  partitionings  (  when  two  or  more 
subsets  of  system  sites  lose  the  ability  to  communicate  with  each  other). 
(Communication  line  failures  that  do  not  partition  the  network  can  be  masked 
by  message  re-routing).  When  any  of  these  failures  is  announced  by  the 


database  operating  system,  a  query  optimization  algorithm  must  acjust  its 
execution  strategies  by  excluding  failed  sites,  or  sites  in  foreign  partitions 
from  its  list  of  available  resources.  Depending  on  the  partitioning  policy 
enforced  by  a  given  database  management  system,  it  may  also  be  necessary  to 
cease  accessing  certain  data  (  sometimes  all  data  )  in  a  certain  partition  ( 
sometimes  in  all  but  one  partition  )[BhLi].  After  recovery  from  a  failure, 
announced  by  the  database  operating  system,  the  query  optimization  algorithm 
should  adjust  its  execution  strategies  again. 

(c).  Logic  queries 

There  are  a  number  of  proposals  to  optimize  logical  queries,  i.e., 
queries  are  expressed  in  first-order  logic  as  a  collection  of  Horn  clauses 
(see,  for  example,  [HaLu,  loan,  JaCV,  BMSu,  SaZa,  Nang,  Ullma,  Warr]).  Our 
feeling  is  that  while  the  ideas  presented  are  reasonable,  they  have  to  be 
evaluated  with  realistic  database  systems.  For  example,  the  strategy  in 
[Warr]  suggests  the  rearrangement  of  subgoals  so  that  the  subgoal  to  be 
executed  next  is  the  one  with  least  expected  number  of  alternatives.  It  is 
assumed  that  all  attributes  are  indexed  and  therefore  all  tuple  accesses  are 
equally  costly.  In  realistic  database  environment,  some  attributes  of  certain 
relations  have  fast  access  paths,  while  others  do  not,  as  explained  in  the 
cost  model  of  the  replicate  algoritm  [YGCC,  YGZT].  Thus,  different  tuples  may 
be  accessed  with  different  costs. 

Evaluation  of  a  relation  at  a  time  [Ullma]  instead  of  a  tuple  at  a  time 
is  realistic  in  database  environment.  However,  cost  equations  are  not 
provided,  which  may  prevent  the  construction  of  optimal  or  near^-optimai 
strategies.  [HaLu]  introduced  3  rather  general  algorithms  to  process  linear 
recursive  queries.  However,  the  bounding  of  variables  are  not  considered.  The 
analyses  provided  by  [loan,  .Nang]  are  extremely  useful  but  have  applied  to 
very  restrictive  class  of  queries.  We  believe  that  their  analyses  when 
applied  to  more  general  queries  will  yield  very  useful  results. 

Our  approach  is  to  modify  existing  proposals  (for  example[GaMN ,  HaLu, 
loan,  NaHe,  Nang,  Warr,  Ullma]),  extend  the  research  to  distributed 
environments  and  test  the  performance  in  actual  networks.  Algorithms  will  be 
refined  based  on  experimental  observations. 

4.  Aim 

The  aim  is  to  develop  an  algorithm  that  is  capable  of  processing 
efficiently  relational  and  logical  queries  in  a  general  and  dynamic 
environment.  In  order  to  accomplish  this,  it  is  essential  that  realistic 
experiments  be  performed  on  actual  networks  to  gain  understanding  of  the 
effect  of  different  steps  on  performance. 
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ABSTRACT 


Research  in  the  area  of  concurrency  control  performance 
evaluation  has  extensively  been  addressed  in  the  past  few  years. 
Unfortunately,  conclusions  arrived  to  by  many  researchers  were 
inconsistent  and  at  times  contradictory.  This  inconsistency  has 
arisen  due  to  the  different  assumptions  and  performance  models  used. 
In  order  to  eliminate  this  situation,  a  unified  performance  model  is 
needed.  In  this  paper,  we  propose  a  framework  for  concurrency  control 
performance  evaluation  in  single  site  databases.  Areas  of  performance 
studies  are  classified,  and  a  suggested  performance  model  is 
specified. 


1- INTRODUCTION 

Despite  the  very  large  number  of  concurrency  control  algorithms 
in  both  single  site  and  distributed  databases,  there  does  not  exist 
any  formal  quantitative  method  for  analyzing  and  comparing  their 
performance.  However,  a  number  of  performance  studies  were  conducted 
in  the  last  few  years.  Different  assumptions  and  performance  models 
were  used  in  each  of  these  studies,  thereby  leading  to  incomparable 
and  in  many  cases  contradictory  results.  Therefore,  future 
performance  studies  should  be  based  on  a  unified  framework.  In  this 
paper,  a  framework  for  concurrency  control  performance  evaluation  in 
single  site  databases  is  proposed. 

In  section  2,  an  overview  of  database  systems  and  performance 
studies  is  given,  followed  by  the  proposed  performance  evaluation 
model  in  section  3.  Finally  some  concluding  remarks  are  discussed  in 
section  4. 


2.0-OVERVIEW 


2.1 -DATABASES  AND  CONCURRENCY  CONTROL 


Figure  1  :  Multiuser  Database  System 


A  single  site  database  system  (Figure  1)  consists  of  three 
components  :  the  transaction  component,  the  database  management 
system  (DBMS)  component,  and  the  database  component.  Each  component 
has  a  set  of  logical  and/or  physical  parameters  which  constitute  the 
overall  system  characteristics.  User  transactions  interact  with  the 
DBMS  by  issuing  read  and  write  requests  for  database  items  stored  in 
the  database.  The  concurrent  execution  of  transactions  can  result  in 
data  misuse  (lost  updates  and  dirty  reads  [Bern81]).  Therefore, 
database  systems  include  a  subsystem  called  the  concurrency  controller 
to  control  concurrent  accesses  to  the  shared  database. 

A  concurrency  control  algorithm  (CCA) ,  can  be  designed  using  one 
of  three  approaches  :  Locking  [Mena78,Rose78] ,  Timestamping 
[Reed78,Thom79,Bern81] ,  or  the  Optimistic  approach  [Baye80,Kung81] . 
All  previous  approaches  can  be  implemented  to  allow  for  multiple 
versions  of  a  database  [Reed78,Baye80,Care86] . 


2.2-OVERVIEW  OF  PERFORMANCE  STUDIES 


There  are  five  areas  to  which  performance  studies  should  be 
directed.  In  the  remainder  of  this  section,  each  area  is  explained 
briefly. 

System  Throughput  and  Transaction  Response  Time 

These  are  two  indices  of  special  importance.  Minimizing  the 
average  transaction  response  time  is  of  special  interest  to  the  user, 
whereas  maximizing  the  system  throughput  is  of  special  interest,  to  the 
system  manager.  Most  researchers  have  directed  their  efforts  to  this 
area. 
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Algorithm  Efficiency  and  Asymptotic  Behavior 


These  studies  investigate  the  useless  percentile  of  storage,  CPU, 
I/O,  and  for  the  distributed  case,  intersite  communications.  This  is 
called  algorithm  overhead,  and  it  can  be  viewed  as  cost  factors  which 
affect  the  system  throughput  and  the  transaction  response  time.  Many 
researchers  directed  their  researcn  to  this  area  [Ries77,Ries79] . 

The  Effect  of  Varying  Transaction  Component  Parameters 

The  transaction  component  parameters  affecting  system  throughput, 
transaction  response  time,  and  algorithm  efficiency  are  listed  below  : 

1-  The  number  of  transactions  in  the  system  (load  or  degree  of 
multiprogramming) . 

2-  Transaction  size  in  terms  of  the  number  of  read  and  write 
operations . 

3-  CPU  resource  requirements. 

4-  Read/Write  mix  (ratio  of  query  to  update) . 

5-  Locality  of  requests. 

6-  Read/Write  operation  sequence  (either  interleaved  or  clustered 
into  a  read  step  followed  by  a  write  step) . 

7-  The  time  between  requests. 

8-  Transaction  semantics  (for  formal  definition  of  transaction 
semantics,  see  [Bhar84]). 


Each  of  the  above  listed  parameters  has  an  impact  on  the  overall 
system  performance.  Most  researchers  investigated  1,  2,  and  3  above. 
In  these  studies,  the  degree  of  multiprogramming  ranged  from  5  to  500 
transactions;  small  and  large  transaction  classes  were  used  to 
represent  different  transaction  sizes.  The  effect  of  the  fourth 
parameter  (Read/Write  mix)  was  studied  by  [Hela85,Care86]  and  there  is 
no  reported  study  addressing  the  effect  of  5,  6,  7  and  8  above.  It 
should  be  noted  that  transaction  semantics  do  not  directly  affect  the 
performance;  however,  CCAs  can  be  designed  to  utilize  transaction 
semantics  to  improve  the  performance. 


The  Effect  of  Varying  the  DBMS  Component  Parameters 


The  DBMS  component  parameters  affecting  system  throughput, 
transaction  response  time,  and  algorithm  efficiency  are  listed  below  : 


1-  The  concurrency  control  algorithm. 

2-  The  recovery  technique. 


3-  Other  modules  such  as  :  data  access  method,  security  module, 
and  query  optimization  module. 


The  CCA  has  a  great  impact  on  performance  and  most  of  the 
research  in  this  area  has  been  focused  on  :  two-phase  locking,  basic 
timestamp  ordering,  serial  validation  and  multiversion  algorithms. 
Recovery  techniques  affect  the  way  CCAs  are  designed  and  consequently 
the  overall  system  performance.  Two-phase  commit  is  the  most  widely 
used  technique  in  single  site  databases.  Although  the  effect  of  the 
various  recovery  techniques  can  be  studied  separately,  the  effect  of 
each  technique  on  the  CCA  should  be  investigated.  The  effect  of  the 
two-phase  commit  on  the  dynamic  two-phase  locking  and  the  serial 
validation  algorithm  was  studied  by  [Hela85] . 


The  Effect  of  Varying  the  Database  Component  Parameters 

The  database  component  parameters  affecting  system  throughput, 
transaction  response  time,  and  algorithm  efficiency  are  listed  below  : 

1-  The  data  model  (relational,  hierarchical,  network,  etc  ...) 

2-  The  physical  database  size  (given  in  terms  of  physical  units 
called  items) 

3-  The  granularity  (a  granule  may  contain  more  than  one  item) 


Like  transaction  semantics,  the  data  model  has  no  direct  effect 
on  the  performance,  but  a  CCA  can  be  designed  to  utilize  a  specific 
data  model  for  better  performance.  Granularity  is  a  critical 
parameter  which  dramatically  affects  the  overall  system  performance, 
especially  when  the  system  is  loaded.  Ries  and  Stonebraker  [Ries77, 
Ries79]  have  investigated  the  effect  of  "locking"  granularity  on  the 
performance  of  DBMS.  However,  granularity  should  also  be  studied  for 
non-locking  algorithms. 


3-THE  PROPOSED  PERFORMANCE'  EVALUATION  MODEL 


In  this  section,  a  performance  model  to  be  used  in  evaluating  the 
performance  of  concurrency  control  algorithms  is  given.  The  model 
consists  of  four  components  :  the  transaction  workload  description, 
the  system  structure  and  parameters,  the  concurrency  control 
algorithm,  and  the  performance  indices.  In  the  remainder  of  this 
section,  these  components  are  explained  first  and  then  a  set  of  model 
assumptions  are  stated. 


(a)  Transaction  Workload  Description 

Transaction  workload  can  be  specified  using  the  following 
parameters  : 

1-  The  number  of  transactions  in  the  system  (degree  of 
multiprogramming)  or  in  case  of  open  system  modeling,  the  average 
arrival  rate  and  the  distribution  of  the  arrival  process  (usually 
assumed  to  be  poisson) . 

2-  Transaction  size  which  is  the  number  of  the  read  and  write 
operations  in  a  transaction.  This  can  either  be  deterministic 
(fixed  size  with  the  possibility  of  multiple  transaction  classes) 
or  probabilistic  (usually  uniformly  distributed  over 
[min_size . . . maxsize] ) . 

3-  The  amount  of  CPU  time  required  by  a  transaction.  This  can 
either  be  neglected  or  assumed  to  be  exponentially  distributed  with 
some  mean . 

4-  The  Read/Write  mix  which  can  either  be  deterministic 
(e.g.  80%  read  and  20%  write)  or  probabilistic  (e.g.  prob 
{next  request  is  read}  >  0.2).  It  is  possible  to  have  different 
transaction  classes  that  range  from  pure  query  to  heavy  update. 

5-  The  Read/Write  operations  sequence.  It  can  be  either 
interleaved  or  clustered. 

6-  The  distribution  of  the  time  between  access  requests.  This 
is  not  known  to  have  a  certain  distribution  and  can  be  assumed 
uniform. 

7-  Transaction  semantics.  They  can  be  described  by  redefining 
the  read  and  write  operations  as  pairs  (  P, Operation  )  where  P  is 
some  predicate  that  when  evaluated  True,  the  R/W  operation  would  be 
performed.  We  give  no  specification  for  the  predicate  P. 


(b)  The  System  Structure 

The  system  structure  is  depicted  in  Figure  2.  This  structure  can 
be  used  in  modeling  both  open  and  closed  systems.  When  a  new 
transaction  arrives,  its  workspace  is  initialized  and  it  is  put  on  the 
concurrency  controller  queue  ,CCQ.  The  concurrency  controller,  CC 
then  serves  the  transaction  by  :  (i)  allowing  it  to  the  database, 
(ii)  blocking  it  in  the  block  queue,  BQ,  ( i i i )  aborting  it,  or 
(iv)  committing  it.  In  the  case  of  a  non-blocking  CCA,  the  BQ  is  not 
used.  If  the  transaction  must  wait  for  an  unavailable  resource,  it  is 
put  on  the  BQ.  When  the  resource  is  available,  it  is  dequeued  and  is 
put  back  on  the  CCQ.  Once  allowed  into  the  system,  a  transaction  with 
a  read  request  is  placed  on  the  query  queue,  QQ.  However,  in  case  of 
a  write  request,  no  disk  write  operation  takes  place  and  the  new  value 
is  updated  only  in  its  workspace.  When  a  transaction  finishes,  the  CC 
puts  it  on  the  update  queue,  UQ  so  that  all  its  updates  may  be 
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Not  yet  complete 


Figure  2  :  System  Structure 


committed.  The  UQ  has  higher  priority  than  the  QQ.  If  a  transaction 
is  aborted,  the  CC  resets  or  modifies  its  workspace,  possibly  delays 
it,  and  then  resubmits  it  to  the  system  through  the  transaction 
initiator,  TI .  When  a  new  transaction  arrives  at  the  TI,  the  later 
submits  a  small  setup  transaction  with  zero  or  one  read  operation. 
This  setup  transaction  is  guaranteed  to  commit  with  no  restarts  and  it 
represents  a  setup  cost  for  newly  arriving  or  restarted  transactions. 
When  the  setup  transaction  commits,  the  original  transaction  is 
considered  by  the  CC. 


(c)  The  Concurrency  Control  Algorithm 

The  concurrency  control  algorithm  is  specified  by  a  piece  of 
actual  code.  Since  this  code  is  sharable  by  different  transactions, 
it  is  considered  as  a  critical  section  and  is  modeled  by  a  single 
server  queuing  model  with  FIFO  discipline  and  with  service  time  found 
by  one  of  two  methods.  First,  a  counter  can  be  used  to  count  the 
number  of  actually  executed  statements  while  calling  this  code.  This 
exact  count  gives  an  exact  service  time  for  the  transaction  being 
served.  This  exact  service  time  can  then  be  used  as  a  predicted 
service  time  for  the  next  transaction  to  be  served.  Second,  using  the 
first  method,  simulation  can  be  conducted  to  compute  the  average 
service  rate  which  can  then  be  used  in  subsequent  experiments. 


(dl  The  Performance  Indices 

Following  are  some  general  indices  that  should  be  measured  in  any 
performance  study. 

1-  Average  transaction  response  time. 

2-  System  throughput. 

3-  Average  degree  of  concurrency  (DC) 

4  Conflict  rate. 

5-  Maximum  number  of  times  a  transaction  was  restarted. 


6-  Useless  I/O  and  CPU  percentile. 


There  are  also  certain  indices  applicable  only  to  specific  CCAs. 
An  example  is  the  deadlock  rate  in  the  two-phase  locking  algorithm. 
To  define  the  degree  of  concurrency,  DC,  let  Tn  be  the  duration  of 
time  through  which  the  number  of  active  transactions  in  the  system  is 
n.  The  degree  of  concurrency  is  then  given  by  : 

N 

DC  =  1/  (NTs)  *  LnTn 

Where  Ts  is  the  system  time  at  which  the  DC  is  measured  and  N  is  the 
number  of  transactions  in  the  system.  Clearly,  0<DC<1. 

MODEL  ASSUMPTIONS 

1-  Two-phase  commit  is  incorporated  with  the  CCA  used.  Thus  a 
write  operation  is  treated  as  a  non-disk  operation  and  is  committed  on 
transaction  completion.  Two-phase  commit  begins  at  transaction 
completion  by  issuing  1  +  |writeset|  disk  operations  (one  in  the  first 
phase  and  |writeset|  in  the  second  phase). 

2-  Resources  are  finite  :  the  physical  database  is  stored  in  one 
disk  unit  with  some  service  rate.  The  case  of  more  than  one  disk  unit 
can  equivalently  be  studied  by  one  disk  unit  with  a  higher  service 
rate.  Also,  there  is  only  one  CPU  with  some  speed  in  the  system. 

3-  Elements  of  the  readset  and  the  writeset  of  a  transaction  are 
distinct. 

4-  A  transaction  workspace  consists  of  two  parts  :  the 

transaction  definition  and  the  transaction  data  section.  When  a 
transaction  is  restarted,  only  its  data  section  is  reset. 


4-CONCLUSION 

In  this  paper,  we  have  classified  the  research  studies  of 
concurrency  control  performance  evaluation  into  five  major  areas, 
three  of  which  reflect  the  effect  of  the  database  environment  on  the 
performance.  Also,  we  proposed  a  framework  for  performing  these 
studies  using  a  unified  performance  model. 
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Abstract 

Distributed  database  systems  are  of  utmost  importance  to  military  exercises 
because  such  applications  are  inherently  distributed  and  require  severe  performance 
constraints  such  as  fast  response  time  and  continued  operation  in  face  of  catastrophic 
failures.  In  this  talk,  we  examine  several  issues  in  the  design  of  a  high  performance 
distributed  database  system.  In  particular,  we  discuss  the  problem  of  concurrency 
control  which  deals  with  controlling  the  concurrent  access  of  a  database  by  simultane¬ 
ously  running  transactions  such  that  the  correctness  of  the  database  is  maintained, 
and  the  problem  of  reliability  which  deals  with  the  resiliency  to  different  kinds 
failures.  We  examine  the  state  of  the  art  of  these  design  issues  and  discuss  future 
directions  for  distributed  database  systems. 


1.  Introduction 

A  database  is  a  collection  of  related  data  objects,  which  is  shared  by  several  users  with 
potentially  diverse  interests.  The  data  objects  of  a  database  must  usually  satisfy  certain 
semantic  relationships,  referred  to  as  the  consistency  assertions  of  the  database  [22].  A  data¬ 
base  is  consistent  if  values  of  its  data  objects  satisfy  all  of  its  consistency  assertions. 

The  falling  cost  of  hardware  and  advances  in  communication  technology  over  the  last 
decade  have  made  distributed  computing  viable  and  have  triggered  interest  in  distributed 
database  systems  (e.g.,[7,  45,  48]).  In  such  systems,  data  objects  are  spread  over  a  collection 
of  computers  which  are  connected  via  some  communication  network.  A  distributed  data¬ 
base  system  offers  several  advantages  over  a  single-site  database  system  [l],  such  as  data  and 
program  sharing,  higher  system  throughput,  higher  system  availability,  load  sharing,  and 
easy  expandability. 

In  a  replicated  database  system,  all  the  data  objects  of  a  database  are  duplicated  at 
every  site.  Replication  of  data  objects  offers  several  attractive  features:  enhanced  reliability, 
improved  responsiveness,  no  directory  management,  and  easier  load  balancing.  Enhanced 
reliability  results  because  a  site  crash  or  network  partitioning  does  not  prevent  access  to 
some  data  objects  and.  in  general,  does  not  stop  transaction  processing  under  these  failures. 
Because  of  the  enhanced  reliability,  replicated  database  systems  are  highly  desirable  in  hos¬ 
tile  environment  where  site  crashes  or  network  partitionings  are  inevitable  and  continued 
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operation  under  these  failures  is  very  crucial.  These  features  have  led  to  several  commercial 
efforts  in  the  direction  of  replicated  database  systems,  e.g..  SDD-1  [b]  arid  distributed 
INGRES  [45j . 

However,  a  replicated  database  system  must  guarantee  that  all  the  copies  of  a  data 
object  agree  with  each  other.  This  additional  constraint  is  reflected  as  a  consistency  asser¬ 
tion.  As  a  consequence,  in  a  replicated  database  system,  the  concept  of  consistency  has  two 
parts  [46],  internal  consistency  and  mutual  consistency.  Internal  consistency  deals  with  i  he 
relationships  of  data  objects  within  a  database  copy.  Mutual  consistency  requires  that  all 
database  copies  must  be  identical. 

A  user  interacts  with  a  database  by  performing  read  and  write  actions  on  the  data 
objects.  The  actions  of  a  user  are  normally  grouped  together  (as  a  program)  to  form  a  single 
logical  unit  of  interaction  which  is  referred  to  as  a  transaction.  To  enhance  the  efficiency, 
the  actions  from  several  transactions  may  be  executed  in  an  interleaved  manner;  however, 
for  correctness,  the  system  must  behave  as  if  each  transaction  is  processed  atomically,  i.e..  as 
an  indivisible  unit.  There  are  two  kinds  of  transaction  atomicities:  concurrency  atomicity, 
and  failure  atomicity.  Concurrency  atomicity  enforces  that  concurrent  execution  of  transac¬ 
tions  must  behave  as  if  transactions  are  executed  serially.  Failure  atomicity  enforces  all  or 
none  property  of  the  execution  of  a  transaction  in  presence  of  failures. 

There  are  several  issues  in  distributed  database  systems  which  must  be  efficiently 
resolved[37j.  For  examples,  concurrency  control  [ o .  2 ‘2 ] .  deadlocks; If).  31];  directory  manage- 
ment[l6j;  optimal  program  and  data  allocation{3-fj;  distributed  transaction  process- 
ing[l7,30|;  atomic  commit[28,  32);  and  crash  recovery[28.  17] .  In  this  talk,  we  limit  our  dis¬ 
cussion  to  concurrency  control  and  reliability  aspects  of  a  distributed  database. 

2.  Concurrency  Control 

Typically,  in  a  database  system,  several  users  concurrently  access  the  database  by  exe¬ 
cuting  transactions.  For  efficiency  reason,  the  system  executes  the  actions  from  several  tran¬ 
sactions  in  an  interleaved  manner.  Since  the  actions  of  concurrently  running  transactions 
may  access  the  same  data  objects,  if  interleaving  of  the  actions  is  not  controlled  in  some  ord¬ 
erly  way,  several  anomalous  situations  may  arise.  For  examples,  lost  updates,  incorrect 
retrieval,  and  inconsistent  update.  As  a  result,  some  transactions  may  see  an  inconsistent 
state  of  a  database,  and  a  database  may  terminate  in  an  inconsistent  state.  This  fundamen¬ 
tal  problem  is  referred  to  as  concurrency  control.  In  a  database  system,  this  problem  is  han¬ 
dled  by  a  concurrency  control  algorithm  which  controls  the  relative  order  (or  interleaving!  of 
conflicting  actions,  so  that  the  consistency  of  a  database  is  preserved. 

Most  concurrency  control  algorithms  are  based  on  locking  or  timestamps.  In  lorkiny 
based  algorithms,  a  transaction  must  lock  a  data  object  before  accessing  it  ■_>">  \  transac¬ 

tion  can  lock  a  data  object  if  it  is  not  already  locked  by  some  other  transaction.  In  times¬ 
tamp  based  algorithms,  every  site  maintains  a  logical  clock  which  is  incremented  by  one 
when  a  transaction  arrives  at  that  site  and  updated  whenever  the  site  re<ei\e-  a  me-s.ige 
with  higher  (dock  value  (every  message  contains  current  (  lock  value  of  its  -ender  site)  K,n  b 
transaction  is  assigned  a  unique  timestamp  and  conflicting  actions  ate  interleaved  in  order  of 

Two  actions  conflict  if  they  operate  on  the  same  data  object,  md  at  least  one  of  them  is  a  wri'e  iction. 


the  timestamp  of  their  transactions. 

2.1.  Current  Status 

Several  concurrency  control  algorithms  have  been  proposed  for  replicated  database  sy,- 
tems.  Depending  upon  whether  synchronization  is  performed  before,  or  after  an  update 
makes  an  access  to  the  data  objects,  these  algorithms  can  be  divided  into  two  classes:  pes¬ 
simistic,  and  optimistic  [IS].  Algorithms  that  perform  synchronization  before  accessing  data 
objects  are  referred  to  as  pessimistic  algorithms.  Algorithms  that  perform  synchronization 
after  accessing  data  objects  are  referred  to  as  optimistic  algorithms. 

Pessimistic  Algorithms 

In  pessimistic  algorithms,  a  site  executes  an  update  after  carrying  out  some  irittt~i'<- 
communication  for  synchronization.  Depending  upon  the  way  intersite  communication  and 
synchronization  are  performed,  several  pessimistic  algorithms  have  been  proposed  in  the 
literature,  e.g..[2, 6, 8.  21. 25,  .52.  25.  29.  45). 

Optimistic  Algorithms 

In  optimistic  algorithms,  an  update  is  executed  concurrently  with  other  updates 
without  performing  any  synchronization.  However,  before  its  computed  values  are  written 
into  the  database,  some  intersite  communication  is  carried  out  to  determine  if  the  update 
has  conflicted  with  any  concurrently  executing  updates.  In  case  of  conflicts,  the  lower  prior¬ 
ity  update  is  aborted,  else  its  computed  values  are  written  into  the  database.  Depending 
upon  the  way  intersite  communication  and  check  for  conllict  are  carried  out.  M*ver.il 
optimistic  algorithms  have  been  proposed  in  the  literature.  e.g.. [15.  1*.  27.  25.  16] . 

Performance 

The  performance  of  concurrency  control  algorithms  for  replicated  database  s\  't<  im  h.i- 
been  studied  using  simulation  techniques  (e.g..  [27.  15,25.  12j),  as  well  as  analytic  techniques 
(e.g.,  [26.42]).  The  performance  of  a  concurrency  control  algorithm  is  usually  me.i-ured  In 
update  response  time  which  is  defined  as  the  time  interval  between  the  instant  when  a  user 
submitted  an  update  at  a  site  and  the  instant  when  the  update  is  completely  executed  at 
that  site,  and  system  throughput  which  is  the  number  of  transactions  completed  per  unit  of 
time  in  a  system. 

In  pessimistic  algorithms,  a  transaction  locks  out  (i.e..  blocks)  all  the  concurrent 
conflicting  transactions.  If  conflicts  among  transactions  are  frequent,  then  the  performance 
of  these  algorithms  is  limited  due  to  large  blocking  delays.  In  optimistic  algorithms,  a 
conflict  causes  a  abort  and  restart  of  a  transaction  Kach  abort  and  restart  increases  the 
response  time  of  the  update  and  wastes  computing  and  communication  resource-..  I  he  per¬ 
formance  of  optimistic  algorithms  is  limited  because  frequent  conflicts  among  updates  induce 
repeated  aborts  and  restarts. 


2.2.  Future  Directions 


High  Performance  Algorithms 

Since  concurrency  control  is  a  fundamental  problem  in  database  systems,  a  high  perfor¬ 
mance  database  system  calls  for  a  high  performance  (i.e..  short  response  time  and/or  higher 
throughput)  concurrency  control  algorithms.  The  performance  of  the  most  of  the  current 
concurrency  control  algorithms  is  limited  by  the  blocking  delays  due  to  conflicts.  Therefore, 
development  of  high  performance  concurrency  control  algorithms  requires-  reducing  the 
blocking  delays  due  to  conflicts.  Since  conflicts  among  transactions  are  inherent  to  a  system, 
we  require  radically  different  techniques  for  concurrency  control  (where  the  performance  is 
not  limited  by  blocking  delays  due  to  conflicts). 

We  have  developed  a  new  technique  for  concurrency  control  in  replicated  database  sys¬ 
tems  [41],  where  the  blocking  delays  due  to  conflict  depend  upon  the  speeds  of  disk  (I/O  dev¬ 
ice)  and  cpu  rather  than  the  speed  of  communication  network  (as  in  existing  algorithms). 
Since  disk  and  cpu  are  usually  much  faster  than  the  communication  medium,  the  new 
approach  exhibits  a  substantial  improvement  in  the  performance.  We  conducted  a  perfor¬ 
mance  study  [42],  which  shows  that  the  new  technique  has  much  better  update  response  time 
and  throughput  characteristics  as  compared  to  existing  algorithms  (unless  disk  is  very  slow 
and/or  message  propagation  delay  is  small). 

An  interesting  finding  of  the  study  is  that  the  maximum  throughput  of  the  centralized 
locking  algorithm  [25].  and  Vlilenkovic’s  pessimistic  algorithm  [35]  (for  that  matter  of  all 
existing  algorithms)  is  limited  because  updates  inflict  large  blocking  delays  due  to  conflicts. 
On  the  contrary,  the  maximum  throughput  of  the  new  technique  is  limited  by  the  disk 
throughput.  Since  disk  speed  can  be  readily  controlled  by  using  a  disk  of  higher  speed  or  by 
using  more  disks,  the  performance  of  this  technique  can  be  easily  controlled.  Whereas,  the 
performance  of  the  existing  algorithms  is  limited  by  large  blocking  delays  due  to  conflicts 
among  updates.  Since  conflicts  among  updates  are  inherent  to  a  system  and  are  difficult  to 
control,  the  performance  of  these  algorithms  is  difficult  to  control. 

Performance  Analysis 

Although  many  concurrency  control  algorithms  have  been  proposed,  little  progress  has 
been  made  in  the  direction  of  their  performance  analysis.  Queueing  models  of  concurrency 
control  algorithms  for  database  systems  have  two  distinct  features,  viz.  multiple  resource 
possession  and  blocking  (l4j,  which  are  absent  in  the  queueing  models  of  conventional  sys¬ 
tems.  Because  of  these  features,  queueing  models  of  concurrency  control  algorithms  are  not 
amenable  to  the  product  form  solutions  [ij,  and  which  makes  the  task  of  the  performance 
analysis  of  concurrency  control  algorithms  very  difficult.  Since  performance  analysis  of  con¬ 
currency  control  algorithms  for  distributed  database  systems  is  very  important,  there  is  a 
desperate  need  to  develop  new  tools  and  techniques  to  cope  with  the  complexities  of  con¬ 
currency  control  algorithms.  Past  methods  for  performance  analysis  of  concurremw  control 
algorithms  [26,  10)  have  delt  with  this  complexity  by  relying  upon  the  -o  called  approxima¬ 
tion  techniques  [l4j- 
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3.  Reliability  Issues 

Reliability  is  highly  desirable  in  military  exercises  where  continued  operation  under 
catastrophic  failures  is  very  crucial.  Reliability  management  encompasses  several  issues  -uch 
as  atomic  commit  [32,44],  site  recovery  [3,47],  network  partitioning  [11.20],  and  fault  toler¬ 
ance  [9]. 

Atomic  Commit 

In  a  distributed  database  system,  a  transaction  is  committed  atomically  if  it  i-  pro¬ 
cessed  at  all  sites  or  at  none.  Designing  an  atomic-commit  protocol  which  is  resilient  to  arh:- 
trary  site  failures  and  network  partitionings  is  an  extremely  difficult  task.  The  two-pha-e 
commit  protocol  [28.32]  achieves  the  atomic  commit  in  the  environment  where  sites  can  uni¬ 
laterally  abort  a  transaction.  However,  it  is  vulnerable  to  site  failures  as  a  site  failure  may 
block  a  transaction.  In  [29].  the  two-phase  protocol  is  extended  to  four-phase  protocol  where 
in  case  of  a  site  failure,  a  backup  site  takes  over.  In  ill.  13].  nonblocking  protocols  ha\  e  been 
examined  where  a  site  never  blocks  a  transaction  because  of  some  failures. 

Crash  Recovery 

When  a  site  recovers  from  a  failure,  the  state  of  its  database  must  be  restored  to  a  con¬ 
sistent  state.  The  techniques  used  to  restore  the  state  of  a  database  to  a  consistent  state  are 
called  recovery  techniques.  Most  recovery  techniques  retain  some  redundant  information 
about  the  database  or  history  of  transaction  execution,  and  recover  a  site  by  restoring  it- 
database  to  a  previous  consistent  state,  borne  exatnples  of  redundant  information  are  audit- 
trail  or  log  [10,  d8j,  checkpoint  [2dJ.  differential  files  [ldj.  The  recovery  in  S|)l)-I  is  based  on 
the  notion  of  spoolers  (d9j,  where  when  a  site  fails,  all  messages  directed  to  it  are  buffered  at 
its  spooler  sites.  When  a  site  recovers,  it  unspools  all  messages  buffered  for  it  from  its  spooler 
sites.  In  DDM  [12,  13],  data  availability  is  improved  by  using  incremental  recovery  f.fj.  and 
intersite  data  transfer  in  case  of  failures  is  reduced  by  using  log-based  recovery  mechanism. 

Network  Partitioning 

If  transetion  processing  continues  in  presence  of  a  network  partitioning,  the  state  of  the 
database  of  each  partition  may  diverge  with  time  and  the  consistency  of  the  database  may 
be  violated.  Therefore,  when  a  database  system  recovers  from  a  partitioning,  the  -tale  of  'he 
database  in  each  partition  must  be  reconciled  to  a  common  value.  R  very  site  keep-  log  or 
audit  trail,  and  when  a  database  recovers  a  partitioning,  databases  in  different  partitions  ire 
reconciled  by  merging  the  journals  of  different  partitions  to  get  a  common  journal  and 
applying  it  to  all  the  sites  of  t  fie  partitions  being  merged.  Techniques  for  run-i-teuik 
integrating  partitions  of  a  replicate*!  database  sy-tem  are  described  in  I  I.  JO1. 

Fault  Tolerance 

A  system  is  fault  tolerant  if  in  presence  of  certain  faults,  it  automatic  alk  defect-. 
lates.  and  recovers  from  them  to  avoid  a  failure.  SRQl  OI.\[9  is  a  t  ight  lv  -•  oupled  i. 
tolerant  multi-processor  transaction  processing  system  winch  u-e-  hardware  appma>  n  to 
fault-tolerance.  It  employs  three  kinds  of  mechanisms:  II)  error  det>  rt mg  cod'-  while  da'a 
is  stored  in  a  memory  or  while  being  transferred  between  different  module-.  'ta\  an 
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protected  by  error-detecting  codes;  (2)  duplication  and  comparison  -  some  functions  such  as 
address  generation  and  cache  management  are  duplicated  in  hardware.  These  components 
perform  operations  independently  and  cross  check  their  results  with  those  produced  by  their 
duplicates;  and  (3)  protocol  monitoring  -  prevents  starvation  due  to  malfunctioning  of  a 
components  by  detecting  violations  in  the  sequence  and  timing  of  inter-element  communica¬ 
tion. 

Future  Directions 

Most  existing  reliability  mechanisms  block  transaction  processing  or  perform  major 
reconfiguration  in  case  of  a  failure  which  may  degrade  the  performance  in  the  presence  of 
failures.  This  may  be  unacceptable  in  hostile  environments  because  high  performance  (avai¬ 
lability)  is  most  desired  when  some  attack  is  underway.  One  approach  to  handle  this  prob¬ 
lem  is  to  sacrifice  correctness  (consistency)  of  a  system  to  obtain  high  availability  and  fast 
response.  Some  work  in  this  direction  is  reported  in  [24,36].  More  work  still  need  be  done  in 
the  direction  of  formalization  of  the  tradeoffs  between  the  correctness  and  the  performance, 
and  new  techniques  need  be  developed  to  achieve  these  tradeoffs. 
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ABSTRACT 

New  software  development  tools  proposed  for  supercomputers  are  based  on  assertive  programming 
where  a  program  is  expressed  as  a  set  of  assertions  about  its  properties  and  not  as  a  sequence  of  steps 
leading  to  the  solution.  Solving  procedures  are  automatically  generated  from  assertive  description. 

Assertive  programming  for  scientific  parallel  processing  t-  supported  by  equational  languages  in 
which  assertions  are  restricted  to  conditional  equations  Such  languages  are  naturally  suiied  for 
mathematically  oriented  compulations.  Equational  languages  also  proved  to  be  an  effective  tool  for 
describing  genera)  computational  tasks.  Descriptions  of  many  parallel  algorithms  are  greatly  simplified 
when  written  in  that  form 

The  MODEL  equational  language  translator  and  its  three  cooperating  components  compiler,  confi¬ 
gurator.  and  timer  are  discussed  in  this  paper  The  focus  is  on  the  application  of  predicate  analysis  and 
automatic  theorem  proving  in  area  of  arithmetic  inequalities  to  checking  equational  programs  and  optimi¬ 
zation  of  generated  parallel  code 

1.  Introduction 

Software  development  for  parallel  and  vector  computers  is  intrinsically  more  diffi¬ 
cult  than  for  sequential  machines.  In  addition  to  the  usual  challenges,  one  must  con¬ 
sider  communications  overhead  and  synchronization  problems.  Often  performance 
evaluation  is  necessary  to  balance  the  computational  load  assigned  to  processors  of  a 
parallel  computer.  However,  time  performance  can  be  evaluated  only  after  the  software 
has  been  developed  and.  therefore,  leads  to  costly  post-development  tuning.  Specialized 
parallel  computing  hardware  often  requires  software  to  be  developed  in  dialects  of  stan¬ 
dard  languages.  In  particular,  varying  communication  primitives  of  different  parallel 
computers  lead  to  major  software  incompatibility  problems.  A  focus  on  software  effi¬ 
ciency  additionally  contributes  to  the  high  costs  of  software  development.  Distributed 
environments  today  and  massive  parallel  computers  of  the  future  force  the  research  com¬ 
munity  to  seek  novel  approaches  to  parallel  programming. 

Many  new  approaches  are  based  on  assertive  programming.  In  this  paradigm,  a 
computational  problem  is  expressed  as  a  set  of  assertions  about  its  properties  and  not 
as  a  sequence  of  steps  leading  to  the  solution  '-olution  procedures  are  automatically 
generated  from  the  assertive  description  I'sers  are  not  involved  in  the  implementation, 
whose  efficiency  and  correctness  are  assured  by  the  underlying  language  processor. 


Depending  on  the  type  of  assertions  used  as  a  basis  for  notation,  different  languages 
for  assertive  programming  have  been  proposed.  Perhaps  the  best  known  is  Prolog,  in 
which  assertions  are  expressed  as  Horn  clauses.  Automatic  inference  of  new  facts  from 
the  given  rules  and  known  facts  makes  it  an  ideal  programming  tool  for  artificial  intelli¬ 
gence  and  expert  systems.  However,  in  an  important  area  of  scientific  computing,  par¬ 
ticularly  amenable  to  parallel  processing  due  to  regularity  of  problems  and  vast  required 
computations,  its  usefulness  is  limited  because  of  difficulty  of  expressing  numerical  algo¬ 
rithms  in  Prolog. 

Another  important  paradigm  of  assertive  programming  is  based  on  equational 
languages,  where  assertions  are  expressed  as  mathematical  equations  (cf.  Ashcroft  and 
VV adge  (1977).  Hoffman  and  O’Donnell  (1982).  and  Prvwes  and  Szymanski  (1985)). 
Such  languages  are  naturally  suited  for  mathematically  oriented  computations.  They 
are  especially  convenient  for  solving  systems  of  linear  equations  that  may  arise  directly 
(as.  e.g.,  in  econometric  modeling)  or  as  a  result  of  a  discrete  approximation  of  a  system 
of  differential  equations.  Equational  languages  have  also  been  proven  to  be  an  effective 
tool  for  describing  general  computational  tasks  (cf.  Baron  et  al.  (1985)). 

Ail  equational  language  can  simplify  programming,  but  it  puts  additional  burden 
on  a  compiler.  For  example,  flow  of  control  has  to  be  entirely  defined  by  the  compiler. 
Each  variable  can  assume  only  a  single  value  yielded  by  the  defining  equation.  Redefini¬ 
tions  of  a  variable  value,  in  the  sense  of  procedural  programming,  have  to  be 
represented  by  a  vector  of  values,  i.e..  a  new  variable  with  one  dimension  more  than 
redefined  one.  Such  additional  dimension,  if  not  recognized  automatically,  leads  to 
unnecessary  copying  of  redefined  structures  and  low  efficiency  of  generated  object  code 
(cf.  Szymanski  and  Prywes  (1986)).  Therefore  compiler  of  an  equational  language  has  to 
perform  global  and  comprehensive  analysis  of  specifications.  Such  overhead  can  be 
avoided  if  user  is  allowed  to  define  flow  of  control,  like  in  procedural  languages  for 
sequential  processing.  For  parallel  processing  however,  user's  defined  control  is  rearlv 
optimal  because  of  the  complexity  of  the  task.  Therefore,  a  comprehensive  analysis  has 
to  be  performed  anyway.  A  procedural  program  in  this  case  is  overspecified,  increasing 
the  chances  of  the  user  to  introduce  errors  and  limiting  compiler  in  optimization  because 
of  unnecessary  dependencies  imposed  by  the  flow  of  control  (particularly  nesting  and 
scope  of  loops  in  iterative,  numerical  computations).  Clearly,  a  nonprocedural,  defini¬ 
tional  specification  is  more  convenient  for  parallel  processing  than  procedural  one. 

Compilers  for  equational  languages  have  to  include  tools  for  static  analysis  and 
optimization  of  generated  code.  Required  analytic  power  is  beyond  the  scope  of  tradi¬ 
tional  compiler  optimization  and  flow  of  control  analysis.  Moreover  the  object  code  is 
generated  in  a  high-level  language  and  therefore  it  is  exposed  to  these  methods  anywav. 
Therefore  in  designing  equational  language  compilers,  there  is  a  need  to  concentrate  on 
problems  specific  to  nonprocedural  specifications.  Among  the  most  important  are:  find¬ 
ing  the  optimal  sequence  of  object  program  events,  reducing  the  dimensionality  of  data 
structures,  recognizing  circular  dependencies  to  be  solved  as  simultaneous  equations,  and 
assessing  consistency  of  specifications.  Many  of  this  problems  were  already  addressed  in 


the  MODEL  compiler  (cf.  Szymanski  and  Prvwes  (1986)).  A  new  approach  to  predicate 
analysis  is  discussed  in  this  paper. 

The  paper  focuses  on  practical  application  of  theorem  proving  techniques  in  design¬ 
ing  compilers  for  future  software  systems.  It  also  describes  how  the  use  of  an  equational 
language  and  translator  can  simplify  parallel  programming.  The  fundamental  notion 
underlying  this  approach  is  that  prospective  users  have  the  knowledge  and  expertise  in 
the  application  domain  and  not  in  computer  programming.  Users  specify  a  problem 
progressively,  by  providing  its  mathematical  model  with  little  regard  for  how  to  imple¬ 
ment  its  solution.  This  is  done  through  composing  equations  that  express  the  underly¬ 
ing  rules,  relations  and  concepts  of  the  problem  environment  and  not  the  solution.  The 
user  composes  the  specification  cooperatively  with  an  automatic  system  which  performs 
two  tasks.  First,  it  verifies  completeness,  non-ambiguity  and  consistency  of  the  user’s 
input  and  solicits  corrections  and  amplifications  until  it  is  satisfied  with  the  comprehen¬ 
siveness  of  the  specification.  Second,  it  automatically  designs,  programs,  and  schedules 
the  corresponding  parallel  computation.  The  automatic  system  also  provides  timing 
information  and  reports  its  design  decisions. 

2.  Description  of  the  MODEL  Equational  Language  and  System 

The  discussion  in  the  paper  is  based  on  our  experience  with  the  MODEL  equational 
language  and  its  translator  (Tseng  et  al.  (1980)).  The  MODEL  language  provides  the 
user  with  a  static  view  of  computations.  It  shields  users  from  concerns  about  flow  of 
control  or  schedule  of  program  execution  events.  User’s  specifications  are  free  from 
implementation  details.  Thus  they  are  much  shorter  (about  3  to  5  times)  and  easier  to 
comprehend  than  the  corresponding  procedural  programs.  The  MODEL  translator  gen¬ 
erates  efficient,  machine  independent  object  programs.  It  can  be  easily  adapted  for 
cross  generation  of  code  for  specialized  hardware.  It  provides  global  checking  and  error 
recovery  procedures.  Owing  to  that  software  development  with  the  MODEL  language' 
proved  to  be  much  faster  then  with  procedural  languages  (cf.  Szymanski  et  al.  (1984)). 
In  recent  years,  the  research  on  MODEL  focused  on  efficiency  of  the  generated  programs 
(cf.  Szymanski  et  al.  (1980))  and  on  creating  programming  tools  for  parallel/distributed 
environments  (cf.  Prywes  and  Szymanski  (1985)  and  Szymanski  et  al.  (1985)). 

The  MODEL  system  consists  of  three  components:  (i)  the  compiler,  which 
translates  individual  specifications  into  procedural  programs:  (ii)  the  configurator,  which 
creates  distributed  and/or  parallel  computation  from  a  set  of  specifications:  and  (iii)  the 
recently  developed  timer,  which  evaluates  processing  time  (delays)  in  a  specification.  It. 
therefore,  provides  the  basic  data  for  a  computational  load  balance  analysis.  Tlx'  advan¬ 
tage  of  using  MODEL  for  load  balancing  i'  that  flow  of  control  in  the  object  program  i' 
generated  by  the  MODEL  compiler  and  this  flow  of  control  can  be  readily  used  by  tin 
timer  for  performance  evaluation.  Another  advantage  is  the  ease  with  which  MODEL 
specifications  can  be  modified.  Thank'  to  the  nonprocedural  and  static  view  of  computa¬ 
tions,  redesigning  a  system  requires  merely  moving  individual  equations  from  one 
moduli'  to  another.  As  presented  in  Le<  et  al.  I  1986).  thi'  lead'  to  a  new  method  of 
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software  development  in  the  presence  of  time  constraints. 

Software  development  with  the  MODEL  system  consists  of  the  cooperative  use  of 
the  three  above  mentioned  automatic  components,  as  illustrated  in  Figure  1. 


Figure  1.  The  cooperation  of  the  MODEL  system  components  in  developing  an 
application. 


Programming  in  MODEL  starts  with  partitioning  an  overall  computation  into  module s 
and  files.  Initially,  no  consideration  is  given  to  efficiency.  A  module's  area  is  del¬ 
ineated.  along  what  appears  to  the  user  as  the  natural  functional  boundaries.  Modules 
are  also  candidates  for  concurrent  processing.  A  module  consists  of  a  declaration  of 
array  variables  and  a  set  of  equation'  that  relate  these  arrays  to  each  other.  Such 
module  definition  in  the  MODEL  language  i'  called  a  specification.  Array  variables  are 
organized  in  a  tree  structure  which  is  denoted  as  a  file.  A  file  defined  (or  produced I  by 
one  module  may  be  referenced  lor  consumed)  by  another  module.  Files  may  be  stored 
on  external  memory  devices  |i.<\  disks,  tapes i  or  can  be  directly  communicated  between 


modules.  A  computation  is  viewed  on  a  global  level  as  a  graph  or  a  network  whose 
nodes  represent  modules  and  files  and  edges  connect  modules  to  files  that  are  consumed 
or  produced  by  those  modules. 


The  configurator  performs  the  global  level  design  and  programming.  It  accepts  a 
graph  of  the  overall  network  of  modules  and  files  (including  assignment  of  nodes  to  local 
or  distributed  hardware),  called  a  configuration,  as  input.  In  addition  to  checking  com¬ 
pleteness  and  consistency  and  producing  documentation,  it  generates  command  language 
programs  which  schedule  the  execution  of  modules,  maximize  parallelism,  and  set  up 
communication  among  modules.  The  compiler  performs  the  detailed  level  design  and 
programming.  It  accepts  as  input  a  module  specification.  In  addition  to  checking,  it 
generates  a  respective  optimized  program  (in  object,  high-level  language).  The  graph 
form  is  convenient  for  a  configuration  while  the  equation  and  array  form  is  preferable 
for  a  module.  These  two  forms  can  be  translated  to  each  other  and  a  configuration  is 
viewed  as  an  extension  of  module  specifications.  Finally,  the  timer  evaluates  worst  case 
delays  in  a  module  between  any  successive  communicating  messages.  While  little,  if 
any,  computer  skills  are  needed  for  using  the  configurator  and  compiler,  some  expertise 
is  necessary  to  use  the  timing  information  to  tune  a  system  and.  thereby,  improve  its 
performance.  These  reported  delays  are  considered  the  service,  times  of  each  module. 


The  MODEL  system  automatically  implements  communications  between  modules 
by  using  queues  (invisible  to  the  less-expert  user).  Knowledge  of  the  worst  case  queue 
length  and  service  times  allows  a  more-experienced  user  to  evaluate  delays  and 
throughput  in  critical  paths  of  the  configuration.  Balancing  the  load  on  processors  may 
be  attained  by  migration  of  equations  from  one  module  to  another  or  by  splitting  or 
fusing  modules.  Timing  information  can  also  be  used  as  a  basis  for  assigning  modules  to 


processors. 


3.  Predicate  Analysis  in  MODEL  Specifications 


In  the  MODEL  system,  a  specification  is  represented  by  an  array  graph,  where  a 
node  represents  accessing,  storing  or  evaluation  of  an  entire  array  and  the  edges 
represent  dependencies  among  array  variables.  The  underlying  graph  of  elements  of  the 
arrays  may  be  derived  from  the  array  graph  based  on  the  attributes  of  dimensionality, 
range,  and  forms  of  subscript  expressions,  which  are  given  for  each  node  and  edge  in  an 
array  graph.  A  node  A  corresponding  to  a  m  dimensional  data  or  equation  array 

represents  the  elements  from  A ( 1.1 . 1)  to  A(Nl.N2 . Nm)  where  Nl...Nm  denote  the 

ranges  of  dimensions  1  to  m.  respectively.  Similarly,  a  directed  edge  represents  all  the 
instances  of  dependencies  among  the  array  element',  of  the  nodes  at  the  ends  of  the 
edge'.  The  dependencies  imply  precedence  relationships  in  the  execution  of  the  respec¬ 
tive  implied  actions.  There  are  several  types  of  dependencies.  For  example,  a  hierarchi¬ 
cal  precedence  refers  to  the  need  to  access  a  source  structure  before  its  components  can 
be  accessed,  or  vice-versa,  the  need  to  evaluate  the  components  before  a  structure  is 
stored  away.  Data  tlepemh  ncy  precedence  refers  to  the  need  to  evaluate  the 
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independent  variables  of  an  equation  before  the  dependent  variable  can  be  evaluated. 
Similarly,  data  parameters  of  a  structure  (e.g..  defining  a  range  of  a  dimension  in  struc¬ 
ture)  must  be  evaluated  before  accessing  the  respective  structure.  Therefore,  the  array 
graphs  represents  not  only  explicit  data  dependencies  resulting  from  defining  a  variable 
in  term  of  others,  but  also  implicit  ones  implied  by  the  semantics  of  the  language,  like 
hierarchical  or  data  parameter  ones.  Labeling  dependencies  by  the  corresponding  sub¬ 
script  expressions  (that  define  mappings  between  dependent  structure  elements)  enables 
compact  representation  of  the  specification.  Array  graph  compactness  and  uniformity  of 
representation  of  different  dependencies  simplified  greatly  analysis  and  optimization 
procedures  of  the  MODEL  system.  It  is  worthwhile  to  note  that  flow  of  control  is  absent 
from  an  array  graph,  since  it  is  generated  later.  Therefore,  often  superficial,  dependen¬ 
cies  imposed  by  the  way  the  algorithm  is  written  by  the  user  do  not  inhabit  optimiza¬ 
tion  in  the  MODEL  system. 

Data  flowr  analysis  usually  ignores  predicates  associated  with  equations.  As  the 
result  the  following  specification  poses  a  problem: 

A  =  IF  X  >  0  THEN  0  ELSE  B  +  1; 

B  =  IF  X  >  0  THEN  A  +  1  ELSE  0; 

Since  here  A  depends  on  B  and  B  depends  on  A.  the  compiler  would  issue  a  circular 
logic  message  and  would  create  a  procedure  for  solving  what  appears  to  it  a  set  of  simul¬ 
taneous  equations,  defining  A  and  B.  A  slightly  deeper  analysis  shows  that  the  specifi¬ 
cation  is  well  formed,  yielding  directly  A=0.  B=1  when  XX)  and  A  =  1 .  B=0  otherwise. 

There  are  other  instances  where  checking  of  conditions  may  flag  errors  that  are  nor¬ 
mally  undetected.  Those  include  checking  that  initial  values  in  recurrent  relations  and 
exit  conditions  in  recursive  functions  are  well  defined.  Similarily.  the  timer  evaluates 
time  delays  on  every  possible  path  independently  of  conditions  under  which  any  given 
path  can  be  selected.  Therefore,  finding,  for  example,  that  predicates  of  two  equations 
are  exclusive  allows  the  timer  to  set  max(Tl,T2),  not  T1  +  T2  as  the  total  time  delay, 
where  Tl  and  T2  denote  time  delays  of  the  first  and  second  equation  respectively. 

This  suggests  introducing  condition  information  into  the  array  graph.  A  labeling  of 
edges  by  the  respective  predicate,  and  recognition  of  equivalence  or  exclusiveness  of 
predicates  that  label  edges  would  make  the  system  accept  or  reject  some  specifications. 
This  is  an  open-ended  investigation  since  no  algorithmic  procedure  could  solve  the  prob¬ 
lem  in  its  full  generality.  The  main  problem  is  of  course  to  recognize  equivalent  but  not 
necessarily  identical  predicates  as  well  as  exclusive  ones.  We  believe  that  partial 
scheme*  which  we  propose  below  may  solve  a  high  proportion  of  these  cases  and 
enhance  the  system  capability. 

Procedural  verification  techniques  are  usually  based  on  propagation  along  control 
flow  edges  ( cf.  de  Bakker  ( 1 0S0 ) .  Jones  |19>0).  Manna  (197-11.  or  Reynolds  ( 1 9 S 1 1 1 . 
Thus,  the  standard  techniques  of  computing  the  weakest  precondition  and  strongest 
post-condition  rely  on  the  propagation  of  logical  information  backward  and  forward  in 
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the  control  flow  graph.  In  an  analogous  way.  it  is  possible  to  develop  a  "proof  theory" 
corresponding  to  logical  information  propagation  across  an  array  graph.  By  propagating 
logical  information  backward  and  forward,  we  are  able  to  check  consistency  between 
separately  specified  constraints  on  disjoint  variables.  We  are  also  able  to  relate  different 
variables  used  in  predicates  to  source  data  elements.  Therefore,  we  can  rewrite  all  the 
predicates  of  a  specification  in  terms  of  a  uniform  set  of  source  variables:  i.e.  we  are  able 
to  normalize  them.  This  very  important  property  of  normalization  results  from  nonpro- 
cedurality  of  equational  specifications.  Namely,  each  variable  is  uniquely  defined  and 
cannot  be  redefined  in  a  specification,  so  it  is  not  dependent  on  the  stage  of  program 
execution. 

Predicates  in  specifications  consist  of  Boolean  terms  connected  by  Boolean  opera¬ 
tors.  Almost  exclusively,  those  Boolean  terms  are  formed  from  arithmetic  inequalities, 
typically  involving  subscript  expressions.  Thus,  to  assert  the  relations  of  different  nor¬ 
malized  predicates  to  themselves,  we  need  a  theorem  proving  subsystem  in  t lit'  domain 
of  arithmetic  inequalities  (cf.  Bundy  (1983)).  We  decided  to  use  the  Sup-Inf  decision 
procedure  for  an  area  of  real  number  arithmetic.  It  was  originally  developed  for  natural 
numbers  by  Bledsoe  (cf.  Bledsoe  (1974)).  although  it  does  not  constitute  a  decision  pro¬ 
cedure  for  the  corresponding  area  of  natural  number  arithmetic.  The  whole  of  real 
number  arithmetic  is  known  to  be  undecidable.  We  are  going  to  carve  out  a  decidable 
subpart  by  allowing  only  the  additive  functions,  i.e.  by  excluding  trigonometry, 
exponentiation,  etc.  Our  experience  with  the  MODEL  language  shows  r hat  predicates  in 
equations  are  almost  exclusively  limited  to  additive  functions. 

The  idea  of  the  Sup-Inf  method  is  to  show  that  a  negation  of  a  conjecture  is  unsa- 
tisfiable  because  one  of  its  constants  cannot  be  assigned  a  type.  The  negated  conjecture 
is  put  through  a  series  of  normal  forms,  the  last  of  which  assigns  a  type  to  each  Skolem 
constant.  The  Sup-Inf  method  works  with  the  negation  of  the  conjecture  in  disjunctive 
normal  form  and  proves  the  conjecture  if  it  is  unable  to  assign  a  type  to  one  Skolem 
constant  in  each  disjunct.  The  method  can  be  extended  to  formulae  with  free  variables 
by  adding  a  technique  for  eliminating  such  variables.  The  technique  eliminates  vari¬ 
ables  from  the  conjecture  using  interpolation  and  reduces  the  conjecture  to  a  variable 
free  formula,  which  is  then  handled  by  Sup-Inf  (cf.  Bledsoe  (1980)).  Extension  allowing 
proper  Skolem  functions  is  also  possible  and  leads  to  a  decision  procedure  for  formulae 
allowing  both  universal  and  existential  quantifiers  at  any  level.  I  nfortuiiately .  all 
known  decision  procedures  for  such  extension  (called  Presburger  Natural  Arithmetic)  are 
very  inefficient  (cf.  Cooper  (1972).  Shostak  (1979)).  Therefore  we  will  investigate  what 
types  of  inequalities  are  the  most  commonly  used  in  MODEL  conditions  and  how  to 
modify  the  Sup-Inf  method  to  include  the  proper  -ubdomam  of  arithmetic. 

4.  Conclusions 

Definitional  languages  offer  a  new  approach  to  tin  software  development  problem. 
They  relieve  user"-  from  concerns  about  efficiency  and  implementation  details  and  pro¬ 
vide  better  checking  facilities  than  procedural  languages.  They  can  therefore  lead  to  a 
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dramatic  improvement  in  software  productivity.  These  promises  are  however  contingent 
upon  developing  a  language  processor  capable  of  generating  high  quality  code  from 
definitional  specifications. 


In  this  paper  a  novel  technique  of  conditional  equation  analysis  is  proposed.  It  is 
based  on  application  of  theorem  proving  technique  for  global  predicate  analysis  in  equa- 
tional  specifications.  It  leads  to  more  accurate  estimates  of  time  delays,  deeper  specifi¬ 
cation  verification,  and  finer  identification  of  simultaneous  equations.  An  initial  imple¬ 
mentation  of  predicate  analysis  in  stack  structure  definitions  has  resulted  in  significant 
optimization  of  object  programs  generated  by  the  MODEL  translator  (cf.  Szymanski 
and  Prywes  (1986)). 
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AN  INVESTIGATION  OF  PARALLELISM 
IN  RULE  BASED  SYSTEMS 
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ABSTRACT 

For  mar.y  applications  rule-based  programming  languages  are  too  slow  We  are 
studying  the  use  of  parallel  architectures  to  increase  overall  performance  of  a  logic 
programming  interpreter.  Using  a  tightly-coupled  multiprocessor,  we  have  found  that 
substantial  performance  increases  can  be  obtained,  and  the  parallelism  inherent  in 
logic  programs  can  be  easily  exploited. 

1.0  INTRODUCTION 

Rule  based  programming  languages  are  popular  for  AI,  cognitive  modelling,  expert 
systems,  and  databases.  Production  systems,  for  example,  have  been  used  for  medical 
diagnosis  [1],  computer  configuration  [2],  and  spacecraft  payload  scheduling  [3],  Logic 
programming  languages  have  been  used  for  planning  [4),  natural  language  [51  and 
prototyping  database  systems  [6[.  DARPA's  strategic  computing  initiative  predicts  the 
existence  in  the  very  near  future  of  expert  systems  containing  30,000  rules  requiring 
the  execution  of  about  12,000  rules  per  second  as  opposed  to  50  per  second  for 
current  systems.  Performance  increases  can  be  obtained  through  the  use  of  compiler 
and  optimization  techniques(7,8,9],  faster  device  technology  ,  special  purpose 
hardware  [10,111  and  parallelism!  12,13,14).  These  techniques  are  by  themselves 
unlikely  to  produce  the  necessary  performance  increases  ,  therefore,  parallel 
execution  of  expert  systems  (15,16,171  must  be  in  combined  with  these  other 
techniques. 

Parallelism  in  rule-aased  systems  is  easily  identified.  These  systems  consist  of  a  set 
of  facts  or  assertions,  and  a  set  of  rules,  new  information  is  derived  through  the 
repeated  application  of  rules  to  facts.  At  any  moment,  more  than  one  rule  may  be 
applicable  and  these  often  define  alternative  solutions  Conventional  systems  choose 
one  of  these  rules  according  to  some  fixed  strategy  and  apply  the  rule  resulting  in 
new  information.  The  construction  of  the  set  of  alternatives  can  be  performed  by 
concurrently  matching  rules  with  facts;  this  is  called  production-level  parallelism  .  or 
rule-level  parallelism  .  The  parallel  application  of  this  set  of  rules  is  called  search  level 
parallelism  . 

For  the  remainder  of  the  paper  we  focu3  on  a  specific  type  of  rule-based  system, 
logic  programming.  Section  2  briefly  describes  logic  programs  and  the  parallelism  in 
them.  Section  3  gives  a  high  level  description  of  the  architecture  we  are  studying  , 
the  BBN  Butteifly  .  Section  4  describes  some  preliminary  studies  m  rule  and  search 
level  parallelism,  and  section  5  conclusions. 

2.0  LOGIC  PROGRAMS 

A  logic  program  consists  of  a  set  of  assertions  or  facts  d  a  set  of  rules  or  axioms 
in  horn  clause  lorm  1  IS.  19 )  Figure  1  shows  n  simple  program  describing  the 
relationships  between  scholarships  anil  students  Rule  1  may  bo  informally  read  as 
"X  is  entitled  to  a  scholarship  if  X  is  a  student.  X's  gpa  is  Y,  Y  is  greater  than  3.4, 


g  1 ve  ~  k^o  1  •*■  *Mp(  ^«)  <  *  (  7  ** )  4  9  oa(  4  9*"  e^te^f  3.  <)  4  »*"<  or  (  ?«) 
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Logic  programs  can  be  executed  using  a  particular  form  of  resolution  called 
SLD-resolution  [19],  which  we  illustrate  with  the  following  example.  A  program  is 
driven  by  a  query,  or  goal,  such  as:  "find  an  object  Z  such  that  Z  is  entitled  to  a 
scholarship."  The  solution  to  this  query  can  be  viewed  as  search  of  a  proof  tree. 
Figure  2  shows  a  proof  tree  for  this  program  and  goal.  The  goal  matches  with  rule  1 
producing  the  new  goal  find  a  student  such  that  the  student’s  gpa  i3  Y,  Y  is  greater 
than  3.4,  and  the  student  is  a  senior."  For  the  original  goal  to  be  satisfied,  this 
conjunction  of  new  goals  must  be  satisfied.  The  new  goal  matches  with  two  facts 
producing  two  new  goals,  one  of  which  results  in  a  solution  [X  =  tom). 
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Inspection  of  the  proof  tree  reveals  that  parallelism  can  be  exploited  in  several  places, 
most  notably  in  the  match  of  a  goal  to  rules  and  facts  (rule-level)  and  the  search  of 
resulting  new  goals  (search-level). 

2.1  Subrule  parallelism 

Subrule  level  parallelism,  or  intra-rule  parallelism  [16],  refers  to  parallel  unification  of 
a  goal  and  rule.  A  task  is  created  for  each  pair  of  arguments  that  must  be  matched. 
This  level  is  characterized  by  a  high  amount  of  communication  and  synchronization 
between  tasks.  Analysis  [10]  indicates  that,  on  the  average,  there  are  throe 
arguments  to  be  unified,  and  that  two  of  them  share  variables  making  software 
implementation  of  unification  unprofitable.  Further  studies  (10,11,20)  indicate  more 
efficacy  from  special  purpose  hardware  for  unification  than  implementation  on 
general  purpose  multiprocessors. 

2.1  Rule  parallelism 

Rule  level  parallelism,  or  production  level  parallelism,  refers  to  parallel  match  of  a 
goal  with  multiple  facts  or  rules  If  n  task  is  created  for  each  match,  this  lev.  l  13 
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characterized  by  no  communication  between  task3,  no  synchronization,  and 
contention  between  tasks  for  goals.  This  task  independence  makes  rule-level 
parallelism  a  good  candidate  for  parallel  implementation.  The  amount  of  parallelism 
varies  from  program  to  program  from  little  in  some  such  as  append  to  very  much  in 
database  applications  where  queries  are  satisfied  by  many  records. 

A  study  of  six  OPS5  expert  systems  f  15),  showed  an  average  of  about  35  rules  that 
could  be  matched  in  parallel.  Other  research  (101,  showed  this  to  be  about  10  for 
several  Prolog  programs.  It  is  questionable,  however,  whether  these  results 
extrapolate  to  other  expert  systems  because  the  amount  of  parallelism  depends  on 
the  application.  In  large  databases,  for  example,  particular  records  can  occur  in  the 
thousands,  and  one  might  observe  flurries  of  massive  parallelism  for  some  queries,  to 
very  little  in  others.  It  may  be  the  case  that  new  semantics  for  parallel  rule-based 
systems  will  result  in  programs  with  more  parallelism  than  existing  programs  written 
for  uniprocessors. 

An  important  issue  is  that  of  task  size.  At  this  level,  we  define  task  size  as  the 
number  of  goal-to-rule  matches  each  task  performs.  The  amount  of  work  in  a 
particular  match  varies  as  in,  for  example,  p(X)  with  p(cons(l,nil))  and  p(X,Y,f(X)) 
with  p(a,f(a),f(a)).  Decomposing  a  task  into  less  than  one  match  results  in  subrule 
parallelism.  Optimal  task  size  depends  on  features  of  the  particular  architecture  such 
as  task  creation  overhead,  frequency  and  cost  of  task  communication. 

2.2  Search  Parallelism 

Search  level  parallelism  refers  to  parallel  search  of  the  proof  tree.  Search  ievel 
parallelism  consists  of  AND  and  OR-parallelism  (14),  and  many  different 
implementations  have  been  proposed  and  studied  (12,21,22,23). 

2.2.1  OR-parallel  search 

Since  alternate  branches  represent  alternate  solutions,  they  can  be  searched  with 
little  or  no  communication  between  paths.  Variable  bindings  and  modifications  to  the 
knowledge  base  produced  by  separate  search  paths  must  be  kept  separate. 
Combinatorial  explosion  of  the  proof  tree  necessitates  scheduling  heuristics  in  order 
to  focus  on  promising  paths. 

Similar  to  rule  level  parallelism  the  amount  of  OR-parallelisra  varies  from  application 
to  application.  If  a  program  contains  many  rules  with  the  same  head,  parallelism  will 
be  high.  Even  if  only  a  single  solution  exists,  speedup  can  be  obtained  from 
OR-paraJlel  search  since  this  takes  the  place  of  the  depth-first  search  with 
backtracking  used  by  most  systems 

2.2.2  AND-parallel  Search 
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In  AND-parallel  search  conjuncts  in  a  goal  (nodes  on  a  path)  are  solved  in  parallel. 
Synchronization  and  communication  is  necessary  if  variables  are  shared  between 
conjuncts.  In  figure  2,  for  example,  GPAfjane.Yi  and  Y>3  4  must  agree  on  a  value 
for  Y.  Furthermore,  GPA(jane,Y)  should  be  solved  first  since  the  se^  of  values  it 
could  return  for  Y  is  much  smaller  than  the  set  of  values  greater  than  3.4.  Solving 
GPA(jane.Y)  first  would  thus  constrain  the  search  for  Y>3  4.  Class(senjane), 
however,  can  tie  solved  in  parallel  with  the  other  two  conjuncts  without  any  extra 
overhead 

AN'D-parallelism  is  not,  in  general,  detectable  at  compile  tune,  because  it  cannot 
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occur  when  two  conjuncts  contain  the  same  unbound  variable  or  when  they  contain 
different  variables  which  have  been  set  to  the  same  unbound  variable.  These 
bindings  occur  at  run-time  therefore  AND-parallelism  is  only  profitable  if  the  amount 
is  large  enough  to  overcome  run-time  checks  [241  A  different,  but  less  flexible 
approach,  is  to  provide  a  syntax  to  make  AND-parallelism  explicit  in  the  program 
[251. 

3.0  BBN  Butterfly 

The  Butterfly  is  a  shared  memory  tightly  coupled  multiprocessor  consisting  of  n 
nodes  connected  by  a  butterfly  (FFT)  switch.  Our  machine  has  16  nodes  each 
consisting  of  1  Mbyte  of  memory  as  well  as  a  M68000  microprocessor  (PE)  and  a 
bit-slice  co- processor  called  the  process  node  controller  (PNC).  The  PNC  handles  all 
memory  references  for  the  node. 

Each  node’s  memory  is  partitioned  into  a  globally  addressable  and  locally  addressable 
memory.  When  a  processor  allocates  memory,  it  can  allocate  to  its  private  memory, 
global  memory,  or  globally  mapped  local  memory.  In  the  absence  of  contention, 
accesses  to  local  memory  take  2  usee  verses  4  usee  for  a  remote  memory  access. 

4.0  Approach 

We  are  interested  primarily  in  rule  and  search  level  parallelism  because  we  believe 
these  to  be  the  most  promising  for  performance  increases.  This  is  due  to  the  lack  of 
communication  and  synchronization  required  between  tasks  executing  at  these  levels. 
It  13  our  goal  to  understand  how  task  granularity  and  memory  allocation  affect  the 
performance  of  an  interpreter  which  exploits  these  levels. 

4.1  Performance  of  Rule  Level  Parallelism 

Our  first  approach  to  parallel  unification  combines  uniform  distribution  of  patterns 
over  shared  memory  with  coarse  granularity  (few  tasks  performing  many  matches). 
The  set  of  patterns  is  evenly  divided  into  disjoint  subsets,  one  for  each  processor. 
Tasks  consist  of  matching  a  goal  stored  in  a  processor’s  local  memory  with  each 
pattern  in  the  processor’s  assigned  subset  (stored  in  shared,  possibly  remote, 
memory).  The  patterns  we  use  are  average  in  size  and  complexity  based  on  the 
previously  mentioned  studies  of  Prolog  programs  [10]. 

In  a  rule-based  system,  pattern  size  varies  widely  which  may  cause  end-effects  for 
this  coarse-grain  approach;  this  arises  if  a  number  of  tasks  match  more  complex 
patterns  than  other  tasks.  It  is  therefore  important  that  tasks  are  created  in  such  a 
manner  that  work  is  evenly  balanced  across  them. 

In  spite  of  equal  task  sizes,  the  results  in  figure  3  illustrate  that  coarse  granularity 
suffers  from  an  end  effect  due  to  memory  and  switch  contention. 


End-efTects  can  be  mitigated  by  using  finer  granularity  (  more  tasks  performing 
fewer  matches).  At  the  extreme,  each  task  consists  of  matching  the  goal  to  exactly 
one  rule  or  fact.  In  a  system  where  patterns  vary  greatly  in  size,  this  fine  grained 
approach  allows  for  load  Furthermore,  the  lack  of  communication  between  tasks 
suggests  a  very  fine  grain  approach,  1  match  per  task.  Reducing  task  size  further 
results  in  subrule-level  parallelism. 

As  expected  the  very  fine  grain  approach  achieves  faster  runtimes  than  the  coarse 
grain  approach  (figure  3).  There  is  a  cutoff,  however,  at  which  decreasing  task  size 
does  not  appear  to  improve  performance.  Decreasing  the  number  of  matches  per  task 
past  1  introduces  too  much  overhead. 

Another  way  of  increasing  the  throughput  of  the  unification  operation  is  to  alter  the 
allocation  of  rules  to  processors’  local  memories.  This  reduces  the  time  to  access 
memory  since  no  contention  can  occur,  and  the  operating  system  does  not  need  to 
make  the  reference  with  exclusive  access.  The  granularity  of  this  approach  is 
dictated  by  the  number  of  processors.  As  indicated  in  the  graph  in  figure  4  the  local 
allocation  scheme  is  no  faster  than  the  best  shared  memory  fine-grained  approach. 
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4.2  Performance  of  Search  Level  Parallelism 

As  a  preface  to  addressing  search  level  parallelism  we  first  must  fit  the  results  of  the 


unification  study  into  the  context  in  which  it  will  be  used.  In  the  previous 
experiment  the  goal  was  already  present  in  private  memory  at  the  time  the 
unification  procedure  began.  For  search  level  parallelism,  new  goals  (paths)  are 
constantly  created  and  must  be  accessed  by  all  tasks.  The  goal  can  be  shared  or 
copied  into  each  task’s  local  memory. 

We  performed  a  test  to  assess  the  affects  of  copying  versus  sharing.  The  results  are 
shown  in  figure  5.  The  most  noticeable  result  is  that  local  allocation  performed 
significantly  better  than  global  allocation  which  was  not  the  case  in  the  previous 
experiments.  This  is  due  to  the  added  contention  on  the  goal  in  addition  to 
contention  on  rules;  local  allocation  involves  only  goal  contention. 
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Figure  5  also  depicts  the  fact  that  regardless  of  granularity  and  allocation,  sharing 
resulted  in  better  performance  than  copying.  Due  to  the  dynamic  nature  of  goals, 
they  were  represented  as  tree  structures  which  required  traversal  for  copying. 

Storing  goals  in  contiguous  memory  can  capitalize  on  efficient  hardware  block 
transfer  mechanisms.  This,  however,  requires  bounds  on  the  size  of  these  structures. 

Our  interpreter  currently  exploits  OR-parallelism  and  a  restricted  form  of 
AND-parallelisra  called  STREAM-parallelism.  STREAM-parallelism  refers  to  pipelining 
the  bindings  between  conjuncts  so  that  the  execution  of  conjuncts  can  be  overlapped. 

A  problem  occurs  when  the  number  of  goals  (leaves  in  the  tree)  grows  much  larger 
than  the  number  of  PEs.  Breadth  first  search  can  be  performed  by  creating  tasks 
for  every  leaf  thereby  expanding  one  level  at  a  time.  This  would  be  valuable  if  few 
solutions  were  ultimately  derivable  and  the  tree  quickly  thinned  out  due  to  failed 
solution  paths.  A  more  complex  scheme  is  to  use  A*  search  based  on  a  programmer 
supplied  evaluation  function.  Another  promising  strategy  is  to  allow  the  programmers 
to  weight  structures  and  choose  the  nodes  with  the  highest  values  according  to 
these  weights.  The  weights  may  be  assigned  to  clauses,  literals,  or  symbols,  but 
incorrect  weighting  can  result  in  infinite  failure. 

Another  problem  is  the  adaption  of  extra-logical  control  features  such  as  assert  and 


cut.  The  assert  operator  allows  the  program  to  modify  the  database  and  cut  controls 
backtracking.  These  extra-logical  features  often  improve  the  performance  of  a 
sequential  interpreters,  however  their  non-logical  semantics  affects  parallel 
interpretation.  For  example,  something  asserted  by  a  clause  can  later  be  retracted  by 
the  clause.  In  the  meantime,  it  must  not  affect  other  search  paths.  The  cut 
operator  is  often  used  to  restrict  the  interpreter  to  returning  a  single  solution.  It 
prunes  the  search  tree  by  removing  predicates  whose  heads  ore  similar.  This  feature 
is  often  abused  resulting  in  "if-then"  execution  of  similar  rules.  These  rules  then 
cannot  be  executed  in  parallel  (unless  eager  evaluation  is  used,  and  the  results 
ordered). 

New  constructs  such  as  a  onc_of  meta-predicate  are  needed.  Thi3  predicate  would 
take  as  an  argument  another  predicate  and  non-deterministically  return  a  single 
solution.  Furthermore,  techniques  from  concurrent  databases  such  as  only  updating 
after  an  explicit  commit  will  be  useful. 

5.0  Conclusions 

For  a  16  node  butterfly  we  were  able  to  show  near  linear  speedup  for  rule-level 
parallelism  on  fine  grained  global  allocation  of  rules  as  well  as  local  allocation  of  rules. 
The  less  than  linear  speedup  near  the  maximum  number  of  PEs  wa3  the  result  of 
memory  and  goal  contention.  This  was  less  of  a  factor  with  local  allocation  resulting 
in  better  overall  performance  with  local  allocation.  This  local  allocation  scheme  is 
similar  to  the  message-passing  paradigm. 

Using  linear  speedup  as  a  measure  of  program  performance  is  not,  in  general,  a  good 
performance  raeasure(26).  This  occurs  when  inefficiencies  in  the  program  dwarf  the 
amount  of  internal  communication.  If  the  program  is  optimized,  the  internal 
communications  may  become  a  bigger  factor  and  preclude  linear  speedup  In 
rule-level  parallelism,  the  program  in  question  is  unifying  all  possible  matching  rules 
with  a  goal.  If  a  more  efficient  unification  procedure  were  implemented,  the  result 
would  be  a  slightly  different  optimal  task  size  than  we  observed,  but  we  would  still 
expect  the  same  results  due  to  the  independence  of  tasks. 

The  less  than  linear  speedup  near  the  maximum  number  of  PEs  was  the  result  of 
memory  and  goal  contention  for  global  allocation  and  goal  contention  for  local 
allocation.  The  upper  bound  on  the  number  of  PEs  that  can  be  used  (inefficiently 
much  of  the  time)  is  essentially  dependent  upon  the  number  of  similarly  headed 
rules  and  facts  which  we  expect  varies  widely.  Additionally,  this  involves  tradeoffs 
such  as  throughput  versus  efficient  use  of  processors.  To  achieve  maximum  benefit 
from  a  large  number  of  PEs,  however,  goal  contention  must  be  minimized. 

Combinatorial  explosion  of  the  search  space  can  be  more  of  a  problem  for  parallel 
interpreters  than  their  sequential  counterparts.  This  necessitates  techniques  for 
allowing  programmers  to  control  the  search  process  rather  them  relinquishing  control 
to  an  exhaustive  search  mechanism. 

Finally,  many  of  the  extra-logical  features  which  are  useful  for  sequential 
interpreters  create  problems  for  parallel  interpreters.  Control  constructs,  however,  are 
stili  desirable  and  necessary  for  parallel  interpretation.  An  important  area  of  furthr 
work  is  the  development  of  control  constructs  which  have  logical  semantics  and 
therefore  do  not  create  problems  for  parallel  interpreters. 
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Abstract 


We  describe  a  new  research  effort  carried  out  jointly  between  the  Software  Engineering 
Institute  and  the  Department  of  Computer  Science  at  Carnegie  Mellon  University.  The 
objective  of  the  project  is  to  investigate  languages,  methodologies,  and  tools  for 
programming  computer  systems  consisting  of  networks  of  heterogeneous  processors. 

Typical  users  of  these  notations  (and  associated  support  software)  will  be  the 
developers  of  real-time,  computation-intensive  applications  such  as  those  contemplated 
under  the  Strategic  Computing  Initiative.  In  particular,  this  research  is  being  conducted 
in  the  context  of  the  Autonomous  Land  Vehicle  application,  running  on  the 
Heterogeneous  Machine  being  developed  in  the  CS  Department. 

This  paper  provides  some  background  on  the  nature  of  the  problem  posed  by  these 
applications,  the  opportunities  presented  by  the  emergence  of  heterogeneous 
machines,  and  the  goals  of  this  project. 


Introduction 

We  are  all  familiar  with  traditional  numerical  computation  applications,  concerned  with 
the  accuracy  and  performance  of  complex  algorithms,  operating  on  simple  data 
structures  (e.g.  scalars  and  arrays)  implemented  in  some  imperative  language  (e.g. 
FORTRAN).  The  appearance  of  list-manipulation  languages  in  the  late  50’s  gave  rise  to 
symbolic  computation  applications,  concerned  with  the  manipulation  of  complex  data 
structures  (lists  and  plexes  of  different  kinds)  implemented  in  relatives  of  Lisp  and 
derivatives  of  Algol  60.  The  hardware  architectures  remained  however,  relatively 
constant  for  a  couple  of  decades  and  a  great  deal  of  progress  was  achieved  in  the 
development  of  useful  programming  languages  and  environments. 

We  are  now  beginning  to  build  networks  of  heterogeneous  processors  whose  users  are 
concerned  with  allocation  of  specialized  resources  to  tasks  of  medium  to  large  size, 
executing  concurrently1  Heterogeneous  machines  (e.g.  Figure  1)  will  have  general 
purpose  processors,  special  purpose  processors,  memory  buffers,  and  switches  which 
can  be  configured  in  more  or  less  arbitrary  logical  networks.  In  addition,  these  networks 
are  not  necessarily  static,  configured  once  and  for  all  for  a  given  application;  the 
networks  could  alter  their  configuration  depending  on  the  needs  of  the  application  at  any 
given  time. 

The  Department  of  Computer  Science  at  Carnegie  Mellon  University  is  building  a 
heterogeneous  machine  as  a  vehicle  for  research  on  high-performance  computing. 
Research  is  also  being  conducted  in  vision  processing  (e.g.  landmark  recognition)  and 
machine  reasoning  (e.g.  path  planning).  These  applications  depend  on  the  large 
amounts  of  computing  power  that  can  be  delivered  by  the  proposed  heterogeneous 
machine.  The  research  described  in  this  paper  is  being  carried  at  the  Software 
Engineering  Institute  and  addresses  the  missing  link,  namely,  the  development  of 
languages  and  methodologies  for  programming  the  heterogeneous  machine,  to  exploit 
the  coarse-grain,  task-level  concurrency  available  in  the  applications.  Exploring  this 
task-level  parallelism  is  a  new  direction  in  parallel  processing. 

The  Nature  of  the  Problem 

It  is  expected  that  users  of  a  heterogeneous  machine  will  rely  on  libraries  of 
painstakingly  developed  procedures  to  accomplish  the  common  operations  in  their 
domain  of  application.  On  a  high  performance  engine,  with  multiple  functional  units, 
pipelines,  and  register  sets,  these  procedures  can  be  very  difficult  to  write  but,  basically 
this  is  within  the  reach  of  current  compiler  technology  and  programming  these  engines 
is  not  the  showstopper. 

The  major  source  of  complexity  in  the  applications  for  which  the  new  heterogeneous 


’For  our  purposes,  let’s  assume  that  a  medium  size  task  granule  takes  in  the  order  of  100  times  a 
basic  synchronization  operation.  That  is,  we  are  not  dealing  with  the  minute  level  of  concurrency  provided 
by  array  processors  (e.g.  ILLIAC  IV)  or  pipelined  functional  units  (e.g.  CRAY),  but  rather  with  the 
scheduling  and  management  of  larger  chunks  of  computation,  with  commensurably  larger  resource 
allocation  requirements. 


machines  are  being  built  (e.g.  Autonomous  Land  Vehicle  [DARPA83])  does  not  come 
from  the  basic  data  operations  (these  are  hidden  in  the  node  procedures)  or  the  data 
structures  (usually  limited  to  arrays  and  records).  The  complexity  comes  from  the 
communication  patterns  between  the  computing  elements  required  to  make  effective 
use  of  the  available  resources. 

The  writers  of  the  application  programs  (e.g  top  of  Figure  2)  must  be  familiar  with  the 
nature  of  the  tasks  performed  by  the  processors  (nodes  in  the  graph)  and  the  contents 
of  the  data  queues  (links  in  the  graph)  in  order  to  program  the  applications.  In  general, 
the  tasks  will  take  different  amounts  of  time  to  complete  and  the  programmer  must 
schedule  the  arrival  of  sufficient  data  at  the  right  time  to  prevent  starvation  of  nodes,  but 
not  so  much  that  queues  overflow.  Thus,  an  expert’s  knowledge  of  the  application  is 
required  to  select  and  connect  the  right  resources  to  achieve  some  optimal 
performance.  Applications  might  have  additional  requirements  that  might  not  be  directly 
expressible  in  terms  of  nodes,  queues,  and  links.  For  instance,  a  requirement  might 
exist  that  some  operation  be  performed  twice  as  often  as  some  other  operation 
elsewhere  in  the  graph  in  order  to  obtain  some  balanced  flow  of  data.  These 
requirements  and  constraints  are  part  of  the  program  and  must  be  explicitly  indicated. 

The  efficient  utilization  of  a  heterogeneous  machine  requires  therefore  support  for 
developing  application  programs  organized  as  multiple,  concurrent,  cooperating  coarse- 
grain  tasks.  The  tasks  in  turn  could  be  more  tightly-coupled  parallel  programs  executing 
on  specialized  processors  such  as,  for  example,  systolic  arrays.  These  two 
programming  levels  can  be  separated  from  each  other.  The  writer  of  a  library  procedure 
that  performs  some  basic  computation  [e.g.  convolution,  histograms,  etc.]  does  not 
necessarily  know  the  context  in  which  the  procedure  will  be  used.  The  procedure 
executes  on  a  processor  that  consumes  data  from  input  queues  and  delivers  results  to 
output  queues.  By  the  same  token,  the  developer  of  the  application  does  not 
necessarily  know  the  details  of  the  procedures  running  on  the  nodes.  The  procedures 
are  treated  like  black  boxes  or  primitive  building  blocks,  with  predetermined,  perhaps 
nominal,  performance  characteristics. 

The  Nature  of  the  Solution 

Suppose  that  the  application  programs  are  represented  as  graphs,  with  nodes 
representing  the  tasks,  and  links  representing  data  communication.  Programming  the 
task  associated  with  a  node  involves  intra-node  concurrency,  while  making  the  nodes  of 
the  graph  to  work  in  parallel  is  inter-node  concurrency.  The  intra-node  concurrency 
problem  has  been  thoroughly  studied,  and  there  are  reasonably  mature  techniques  for 
writing  useful  concurrent  programs  for  intra-node  computations  on  special  purpose 
systems.  However,  the  higher  level,  inter-node  concurrency  is  not  as  mature  or 
understood;  this  is  the  area  of  interest  to  us. 

As  illustrated  in  Figure  2,  a  compiler  for  a  task-level  programming  language  will 
translate  the  application  program  into  code  for  a  virtual  machine.  The  target  ‘‘machine 
language"  will  consist  of  commands  to  be  interpreted  by  a  scheduler  node.  Typical 
commands  include  requests  for  data  movements,  data  transformations,  down-loading 


code  to  the  computation  nodes,  invoking  task,  etc.  it  is  the  job  of  the  scheduler  to 
generate  the  appropriate  low  level  control  messages  and  route  them  to  the  processors 
in  the  system. 


Ideally,  neither  the  language  nor  the  compiler  should  make  assumptions  about  the 
structure  of  the  heterogeneous  machine,  leaving  this  knowledge  to  the  scheduler.  In 
practice,  the  programmers  may  need  to  know  something  about  the  hardware,  to  perform 
application  dependent  optimizations  when  they  choose  to  do  so.  We  are  dealing 
therefore  with  (potentially)  multiple  virtual  machine  layers,  organized  in  a  hierarchy,  and 
implemented  by  networks  of  message-passing  "smart'*  resources  such  as  processors, 
queues,  switches,  etc.  The  programmer  will  develop  an  application  by  specifying  the 
operations  (i.e.,  messages)  to  be  carried  out  by  the  virtual  machine  level(s)  deemed 
optimal  for  the  application  on  hand.  The  range  of  abstractions  provided  by  the  virtual 
machines  has  to  be  available  to  the  programmer  and  this  has  obvious  implications  in 
the  language  design. 

In  this  project  we  will  address  the  following  questions:  How  much  information  about  the 
computing  engines  (nodes)  and  the  tasks  running  on  these  engines  should  be  visible? 
How  much  information  about  the  machine  structure  (in  contrast  to  the  program 
structure)  should  be  visible?  How  much  information  about  the  data  and  control 
communication  infrastructure  should  be  visible?2. 

Project  Goals 

The  main  objective  is  to  develop  a  specialized  programming  language  for  writing 
distributed  programs  with  coarse-grain  concurrency.  Suitable  constructs  will  be 
included  in  the  language  for  specifying  individual  tasks,  their  attributes,  relationships 
between  them,  and  preferred  host  processors  for  task  execution.  In  general,  language 
features  will  be  designed  in  concert  with  the  intended  users  and  the  hardware 
designers.  The  users  will  drive  the  design  of  the  features  needed  to  express  task  level 
programs.  The  designers  will  provide  information  about  the  hardware  capabilities. 
Since  the  hardware  design  is  taking  place  concurrently  with  the  design  of  the  language, 
the  former  is  likely  to  be  affected  by  the  latter  (i.e.,  language  features  needed  to  support 
the  applications  might  require  the  implementation  of  appropriate  hardware  features.) 

The  task-level,  data-flow  notation  described  above  appears  to  be  a  promising  start  in 
this  direction,  especially  for  signal  processing  computations  where  data  continuously 
flow  from  input  nodes  to  output  ones.  There  is  a  reasonably  large  body  of  literature  on 
this  subject  and  we  are  studying  a  number  of  existing  language  models.  Given  the 
nature  of  the  problem,  dataflow  languages  such  as  ID  [Arvind78a;  Arvind78b]  and  VAL 
[Ackerman79;  Dennis79]  come  immediately  to  mind.  However,  the  applications  are 


2We  distinguish  between  the  control  communication  and  the  data  communication  networks.  The 
control  communication  is  used  by  the  processors  and  other  resources  of  the  heterogeneous  machine  to 
exchange  messages  of  various  kinds,  as  they  schedule  (or  reschedule)  the  computation  tasks.  The  data 
communication  network,  on  the  other  hand,  is  used  to  implement  the  data  flow  through  the  machine,  and 
is  likely  to  have  higher  bandwidth  requirements. 
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likely  to  require  more  flexibility  than  a  pure  dataflow  model:  we  need  to  specify 
computations  on  streams  of  data  as  provided  in  Lucid  [Ashcroft77;  Wadge85],  In 
addition  to  the  data  flow  operations,  the  applications  require  the  specification  of  task 
synchronization,  task  control,  and  graph  reconfiguration  under  a  variety  of  conditions. 
To  satisfy  these  requirements  notations  such  as  path-expressions  [Campbell74a; 
Campbell  74b]  might  be  more  appropriate. 

Programs  in  the  task-level  programming  language  will  be  compiled  into  sequences  of 
task  invocation  and  data  communication  operations.  The  first  part  of  the  problem  to  be 
tackled  is  the  definition  of  the  basic  language  concepts:  the  operators  and  operands 
used  to  program  the  machines  at  the  task  level,  ignoring  the  languages  and  support 
tools  needed  to  program  the  basic  tasks  executing  in  the  computation  nodes.  The 
design  and  implementation  of  the  language  and  associated  tools  will  be  an  iterative 
process,  developing  virtual  machines  to  execute  the  task  level  programs.  The  first 
version  of  the  system  will  provide  a  simple  language,  allowing  for  direct  control  of  the 
physical  resources  (the  lowest  level  “virtual  machine”).  The  emphasis  will  be  in 
obtaining  early  feedback  from  the  users.  Later  versions  of  the  system  will  incorporate 
additional  application  requirements  (e.g.,  perhaps  better  user  interfaces.)  and  multiple 
virtual  machine  layers.  The  abstractions  provided  by  these  machines  will  allow 
optimizations  at  the  appropriate  levels  by  both  the  users  and  the  compiler. 

By  Way  of  Conclusions 

This  effort  started  in  January  1986.  In  the  interim,  we  have  been  studying  existing 
language  models  that  could  be  suitable  as  starting  points  for  the  development  of  a  task- 
level  concurrent  programming  language.  At  the  same  time,  we  have  started  the  design 
and  implementation  of  a  prototype  heterogeneous  machine  (to  be  operational  by  the 
Fall  of  1 987)  and  are  building  a  simulator  to  debug  both  the  language  and  the  hardware 
design. 
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Figure  2  --  Compilation  and  Execution  of  a  Task-level  Concurrent  Program 
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Abstract 

Multiprocessor  systems  that  allow  processors  to  access 
individual  words  in  each  others’  local  memories  can  effi¬ 
ciently  execute  a  large  class  of  parallel  algorithms.  Global 
search,  interprocess  synchronization,  and  some  forms  of 
transaction  processing  execute  several  times  more  efficiently 
in  a  shared  address-space  environment  than  in  a  system 
that  provides  only  message- passing.  Though  provided  by 
relatively  few  multiprocessors,  shared  address  spaces  can  be 
implemented  quite  easily  on  systems  that  provide  separate 
communication  and  application  processors  at  each  node. 

1.  Introduction 

Multiprocessors  are  usually  classified  as  either  tightly 
coupled,  with  main  storage  shared  via  a  common  bus  or 
multiported  memory,  or  loosely  coupled,  communicating 
via  message- passing^  In  fact,  there  exists  a  middle  ground: 
a  shared-address  space  implemented  over  the  interprocessor 
communication  network.  Shared  address-space  systems 
have  existed  at  least  since  the  inception  of  Cm*2  in  the 
middle  1970’s.  The  Shared  Memory  Hypercube4  is  a 
recent  example.  However,  relatively  few  multiprocessors 
have  taken  advantage  of  a  shared  address  space. 

2.  A  Cm*  Comparison 

The  chief  advantage  of  a  shared  address  space  lies  in 
its  ability  to  speed  up  references  to  scattered  words  in 
remote  memory.  The  interconnection  structure  treats  a 
remote  memory  reference  as,  in  effect,  a  special-purpose 
message  to  be  transmitted  to  a  destination  node  where  a 
switching  processor  retrieves  or  stores  a  designated  word  of 
memory.  However,  the  message  is  triggered  by  an  ordi¬ 
nary  read  or  write  to  a  mapped  region  of  memory,  so 
explicit  invocation  of  a  send  operation  is  not  required.  On 
Cm*,  remote  memory  references  require  from  9  to  35 
(isec.,  vs.  an  average  instruction  time  of  8  psec.  By  con¬ 
trast,  a  message- passing  implementation  would  require 
these  steps  to  read  a  single  word: 

•  The  requesting  processor  prepares  a  message  and 
invokes  a  send  operation. 

•  The  message  is  sent  over  the  communication  network 
and  stored  in  a  buffer  at  the  receiving  end. 
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•  A  server  process  at  the  receiving  end  notices  the  mes¬ 
sage  and  responds  to  it  by  preparing  and  sending  a 
return  message. 

•  The  return  message  traverses  the  interconnection  net¬ 
work  and  is  stored  in  the  requesting  process’s  message 
buffer. 

•  The  requesting  processor  performs  a  receive  operation 
and  reads  the  desired  word. 

There  are  many  sources  of  inefficiency  in  this  interaction. 
To  begin  with,  two  messages  have  to  be  explicitly  sent, 
requiring  queue  manipulation  of  the  message  buffers  at 
both  ends.  On  Cm*,  sends  and  receives  cost  at  least  110 
(isec.  each,  or  approximately  14  times  as  much  as  the  aver¬ 
age  instruction.  Sending  and  receiving  two  messages  thus 
consumes  56  instruction  times.1  In  addition,  several 
more  instructions  must  be  executed  to  write  the  parameters 
for  the  messages  and  interpret  their  contents.  However, 
this  assumes  that  the  server  process  is  running  at  the  desti¬ 
nation  end  when  the  message  arrives.  If  not,  the  running 
process  must  be  pre-empted  and  the  server  resumed.  Con¬ 
text  swaps  are  notoriously  expensive,  costing  more  than  100 
instruction  times  on  Cm”s  StarOS  operating  system.  Not 
only  is  the  requesting  process  delayed,  but  the  process  run¬ 
ning  at  the  destination  node  is  suspended  for  slightly  more 
than  two  context-swap  times. 

One  can  conclude,  then,  that  on  Cm*  remote-memory 
references  are  about  two  orders  of  magnitude  more  effi¬ 
cient  than  accomplishing  the  same  purpose  by  exchanging 
messages.  This  suggests  that  global  search  algorithms  that 
reference  a  large  distributed  database  would  run  much 
more  quickly  on  a  shared  address-space  multiprocessor. 
But  there  are  other  examples  of  algorithms  that  benefit 
from  a  shared  address  space. 

3.  Experimental  Evidence 

Jones  and  Ardo  5,2  compared  several  different  imple¬ 
mentations  of  synchronization  primitives  for  Ada  rendez¬ 
vous.  Their  benchmark  made  use  of  the  "server”  para¬ 
digm.  with  a  single  master  process  assigning  "tasks  to  a  set 
of  identical  server  processes.  Each  server  process  ran  on  a 
dedicated  processor.  Each  time  a  server  process  finished  a 
task,  it  was  assigned  a  new  task  by  the  master,  until  there 
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were  no  more  tasks  to  do.  This  paradigm  is  typical  of  a 
number  of  parallel  algorithms,  including  partial  differential 
equations,  molecular-motion  calculations,  and  a  variety  of 
matrix  algorithms. 

In  the  message- passing  implementation,  the  master 
sends  a  message  to  the  workers  describing  each  task,  and 
the  workers  respond  by  sending  a  message  back  each  time 
they  finish  a  task.  In  the  busy-waiting  implementation,  a 
worker  waits  on  a  shared  variable  until  the  master  assigns  it 
work,  and  the  master  polls  the  shared  variables  until  it 
finds  a  worker  that  needs  more  work.  A  third  version,  in¬ 
line  embedded  code,  has  the  workers  performing  procedure 
calls  to  indivisibiy  increment  a  counter  that  indexes  into  a 
table  of  task  descriptions  that  are  stored  in  a  shared  table. 
No  master  process  is  needed  in  this  implementation.  The 
latter  two  implementations  require  a  shared  address  space. 

When  processing  time  per  task  was  negligible,  the 
busy-waiting  approach  proved  to  be  about  seven  times  as 
efficient  as  the  message-passing  version.  The  in-line  code 
implementation  has  efficiency  comparable  to  the  busy¬ 
waiting  approach  for  2  processors,  but  contention  causes  it 
to  degrade  as  the  number  of  processors  is  increased. 
Another  experiment  measured  performance  when  process¬ 
ing  time  per  task  varied  randomly  between  0  and  100  msec, 
(or  0  to  80  message-interaction  times).  In  this  case,  in-line 
embedded  code  was  twice  as  fast  as  the  other  two 
approaches  for  small  numbers  of  processors,  but  its  advan¬ 
tage  nearly  vanished  when  more  than  12  processors  were 
used.  One  can  conclude  that  a  shared  address  space  was 
beneficial  when  interprocess  interactions  were  frequent,  or 
when  the  number  of  processors  was  small. 

Sindhu6  investigated  the  problem  of  reliability  in  mul¬ 
tiprocessor  operating  systems.  One  requirement  was  is  that 
the  actions  of  operating-system  utilities  be  restanable  so 
that  an  abort  in  the  middle  of  a  utility  operation  does  not 
damage  system  integrity.  The  problem  is  similar  to  that  of 
nested  atomic  transactions  in  a  database,  but  there  are 


some  differences.  An  ordinary  two-phase  commit  protocol 
is  too  inefficient  because  several  data  structures  need  to  be 
locked  in  the  course  of  a  nested  utility  call.  The  locks  may 
not  be  released  one  by  one  as  their  data  structures  are  suc¬ 
cessfully  updated;  instead  they  must  be  held  until  the  entire 
utility  call  completes  successfully.  Otherwise  an  abort 
could  cause  partial  changes  from  a  utility  operation  to 
become  visible  to  other  operations.  A  full  description  of 
the  solution  is  beyond  the  scope  of  this  paper,  but  it 
involves  making  write  references  to  remote  memory  in  the 
course  of  a  commit.  The  of  sending  messages  instead 
would  increase  the  overhead  of  restartability — already 
estimated  at  30%— to  an  unacceptable  amount. 

4.  Implementation 

The  implementation  of  a  shared  address  space  is  not 
particularly  difficult,  assuming  that  the  distributed  switch¬ 
ing  structure  is  sufficiently  intelligent.  For  example.  Cm* 
used  mapping  processors  known  as  Kmaps  to  map  refer¬ 
ences  by  one  processor  to  the  local  memory  of  another. 
Cm*  had  one  mapping  processor  per  "cluster”  of  10-14  pro¬ 
cessors.  Several  newer  multiprocessors  such  as  the  Ametek 
System  14  and  North  Carolina  State's  B-hive  system1  have 
a  separate  communication  processor  (Intel  80186)  associ¬ 
ated  with  each  main,  or  "application"  processor.  The 
memory-management  unit  of  the  application  processor  can 
be  set  to  map  a  certain  portion  of  the  virtual  address  space 
to  the  communication-processor  "device"  (Figure  1).  This 
virtual  address  consists  of  three  fields,  one  which  specifies 
the  destination  processor,  another  which  specifies  an  object 
number  at  the  destination,  and  a  third  which  gives  an 
offset  into  the  object.  A  directory  at  the  destination  node's 
communication  processor  supplies  the  base  address  of  the 
object  in  remote  memory.  The  communication  processor 
then  references  memory  at  the  remote  node  and  passes 
back  the  data  (in  case  of  a  read)  or  an  acknowledgment  (in 
case  of  a  write).  The  entire  sequence  is  carried  out 
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Figure  1 :  Mapping  an  Address  to  Remote  Memory 


without  any  explicit  system  calls  (such  as  sends)  invoked  by 
either  processor. 

5.  Conclusion 

For  multiprocessors  built  from  nodes  that  each  have 
their  own  private  memories,  the  ability  for  one  node  to 
address  another's  memory  can  speed  up  interprocessor 
interactions  by  up  to  two  orders  of  magnitude.  Several 
algorithms  which  require  fine-grained  interprocess  com¬ 
munication  can  profit  from  this  feature.  The  implementa¬ 
tion  of  a  shared  address  space  is  easy  enough  to  warrant 
serious  consideration  for  any  multiprocessor  with  an  intelli¬ 
gent  communication  network. 
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1  Introduction 

Massively  parallel  computers  based  on  hypercube  architectures  offer  an  alternative  to  traditional 
supercomputers  at  much  less  cost.  Hypercubes  have  been  discussed  in  the  literature  for  several 
decades.  As  early  as  the  mid-1970’s  a  256  processor  machine  was  announced  by  IMS  Associates. 
The  processors  were  Intel  8080’s.  In  1983  a  working  hypercube,  the  64-node  Cosmic  Cube,  was 
demonstrated  at  Caltech  [l].  A  hypercube  of  degree  n  has  N  —  2n  nodes.  The  attractiveness  of 
the  hypercube  over  other  geometries  can  be  attributed  to:  1)  the  slow  (logarithmic)  growth  in 
the  worst-case  intemode  distance  with  N;  2)  the  slow  (also  logarithmic)  growth  in  the  number 
of  connections  to  adjacent  nodes  with  N;  3)  the  recursive  structure  of  hypercubes — allows 
multiple  users  to  have  disjoint  subcubes;  4)  the  similarity  of  nodes — there  are  no  special  edge 
nodes  as  with  arrays,  for  example;  and  5)  the  ease  with  which  trees  and  meshes  of  all  dimensions 
can  be  embedded. 

This  paper  will  summeriie  the  developments  that  one  can  expect  given  the  expected  course 
of  technology  in  the  near  future  and  make  the  point  that  software  is  the  major  obstacle  to  the 
widespread  use  of  hypercube  machines. 

2  The  Present 

A  number  of  other  machines  have  been,  or  are  being,  developed  at  Caltech  [2,3].  Influenced  by 
this,  Intel  developed  the  128-node  iPSC  (personal  supercomputer)  based  on  80286/287  nodes. 
It  was  the  first  production  machine  and  was  introduced  in  July  1985.  A  similar  machine,  the 
Ametek  System/ 14  followed. 

Currently  there  are  commercial  hypercube  machines  with  thousands  of  processors.  The 
Connection  Machine  from  Thinking  Machines  has  64 K  processors  arranged  in  a  degree  12 
hypercube  with  each  node  containing  16  processors  [4].  The  processors  are  fairly  simple  and 
operate  in  SIMD  mode.  The  NCUBE/ten  has  1024  32  bit  processors  that  operate  in  MIMD 
mode.  They  have  a  combined  potential  performance  of  500  MFLOPS  (million  floating  point 
operations  per  second),  although  this  potential  is  rarely  approached  except  on  special  problems. 
A  few  thousand  processors  as  complex  as  state-of-the-art  microprocessors  is  the  limit  that 
current  technology  can  support  for  modest  cost  air-cooled  systems. 


The  University  of  Michigan  has  been  a  beta  site  for  an  NCUBE/ten  for  the  past  6  months 
and  we  have  had  an  opportunity  to  reconstruct  the  decisions  that  went  into  its  creation  and  to 
perform  a  preliminary  evaluation  of  the  resulting  machine  [5].  The  design  process  was  begun  in 
early  ’83,  and  reflects  the  state-of-the-art  at  that  time.  Dominant  considerations  were  packaging 
constraints,  memory  integration  levels,  and  custom  vtsi  integration  levels.  The  result  was  a  node 
made  of  only  seven  components:  a  custom  chip  containing  the  memory  interface,  intemode  and 
I/O  channels,  and  the  cpu;  and  6  memory  chips.  The  node  has  a  footprint  no  larger  them  a 
playing  card.  There  are  128K  bytes  of  memory  per  node,  and  a  10M  Hz  node  can  run  1153 
Fortran  Dhrystones  per  second  (v.  519  for  a  VAX  11/780)  and  445  kilo- Whetstones  per  second 
(v.  395  for  a  VAX  11/780).  The  chip  supports  full  IEEE  754  floating-point  standard  arithmetic. 


3  The  Future 

3.1  Hardware 

The  ability  to  build  reliable  inexpensive  massively  parallel  machines  has  been  demonstrated. 
As  a  matter  of  course  we  can  expect  faster  processors  and  larger  memories.  The  recent  an¬ 
nouncement  of  Intel’s  iPSC-VX,  which  includes  a  high  performance  vector  processing  capability 
at  each  node,  underscores  this  view.  Furthermore,  improvements  (more  speed  and  higher  den¬ 
sities)  in  designs  such  as  the  NCUBE/ten  can  be  expected  to  keep  pace  with  those  in  memory 
chips  since  each  node  is  dominated  by  memory. 

3.2  Software 

Software  is  a  major  obstacle.  The  recent  report  on  the  Supercomputer  Research  Center  con¬ 
cludes  that  the  absence  of  appropriate  parallel  programming  languages  and  software  tools  is  the 
single  biggest  impediment  to  the  successful  use  of  parallel  machines  [6].  Parallel  languages  are 
still  in  their  infancy,  and  critical  development  tools  such  as  parallel  debuggers  are  non-existent. 
This  poor  support  coupled  with  little  understanding  of  how  to  express  problems  as  parallel 
algorithms  may  limit  hypercube  machines  to  special  applications.  Overcoming  this  obstacle 
will  be  a  major  challenge. 
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Introduction 


Currently,  there  are  many  issues  facing  designers  of  parallel  computing  sys¬ 
tems.  The  development  of  an  architecture  independent  programming  environment  and 
selection  of  an  appropriate  implementation  vehicle  are  considered.  The  main  objec¬ 
tives  of  the  programming  environment  should  include  maximal  exploitation  of  con¬ 
currency,  freedom  from  implementation  constraints ,  transportability/ reusability  of 
programs  and  natural  interfacing  of  the  user  with  the  programming  language.  The 
concerns  of  implementing  highly  parallel  systems  include  the  selection  of  an  inter¬ 
connection  network  and  processing  element  (PE)  to  be  used  in  such  a  network, 
development  of  the  PE's  instruction  set  and  fault  tolerance  in  both  the  interconnec¬ 
tion  network  and  PE.  While  programs  should  remain  disjoint  from  the  implementation 
issues,  proper  selection  of  implementation  details  will  greatly  affect  the  efficiency  of 
the  execution  of  such  programs. 

The  basis  of  this  work  can  be  found  in  [1]  which  outlines  a  feasible  implemen¬ 
tation  of  a  general-purpose  SIMD  parallel  computer  of  512  PEs,  capable  of  operations 
on  problem  sizes  of  up  to  32768  data  points.  The  processors  are  connected  with  the 
Cube  Connected  Cycles  (CCC)  network  [6].  The  salient  features  of  the  CCC  net¬ 
work  are:  1)  interprocessor  connections  are  strictly  limited  to  three;  2)  highly  reg¬ 
ular  network  control;  3)  identical  PEs;  4)  a  high  degree  of  pipelining  can  be 
achieved;  5)  efficient  emulation  of  other  networks.  The  computer  is  intended  for 
general-purpose,  scientific  calculations  and  it  is  estimated  that  the  computing  power 
falls  within  the  range  of  500  to  1000  MIPs.  The  reader  is  referred  to  [6]  for  the 
CCC  network's  details  and  to  [1]  for  the  implementation  details. 

2.  Programming  by  Logical  Networks 

We  propose  a  parallel  programming  environment  for  SIMD  computers  based  on 
the  specification  of  an  arbitrary,  or  logical,  network  of  processors.  Specification 
includes  a  topological  description  of  the  network,  description  of  the  network's  PE, 
assigning  I/O  to  a  set  of  PEs  and  defining  an  algorithm  for  the  PE.  A  finite  set  of 
inter-node  and  intra-node  instructions  similar  in  syntax  to  Concurrent  Pascal 
defines  the  algorithm  [1].  Translation  to  an  instruction  sequence  executable  on  the 
physical  machine  is  performed  automatically  by  a  compiler,  which  determines  the 
parallelism  inherent  in  the  problem.  The  underlying  network  should  be  a  general- 
purpose  SIMD  computer,  capable  of  emulating  any  other  network  with  a  reasonable 
time  penalty.  The  Cube  Connected  Cycles  network  has  been  shown  to  be  such  a 
network  [3] .  Emulation  of  logical  networks  whose  size  differs  from  the  underlying 
network  will  also  be  possible.  Programs  will  be  portable  between  machines  since  the 
writing  and  debugging  of  such  programs  is  independent  of  the  machine  executing 
the  program.  Well  known  structures  suitable  for  a  particular  computation  (e.g., 
mesh  for  matrix  multiplication)  can  also  be  used.  By  allowing  the  definition  of  an 
arbitrary  network,  details  of  the  physical  network  remain  hidden,  eliminating  the 
need  for  low-level  information.  To  further  aid  the  user  friendliness,  a  graphical 
interface  will  be  provided  for  entering  the  network  s  connection  pattern  and  pro¬ 
gram.  Such  networks  include  the  tree,  mesh,  mesh  of  trees,  hypercube  and  cube 
connected  cycles  networks. 
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3.  The  Processing  Element  (PE)  for  the  CCC 


A  normal  data  path  cycle  on  a  conventional  processor  consists  of  the  following 
phases:  1)  Operand  Fetch;  2)  ALU  Operation;  3)  Result  Store.  We  denote  the 
aggregate  of  these  phases  as  a  major  cycle  and  each  phase  as  a  minor  cycle.  It  is 
our  intention  to  have  a  processor  that  is  as  simple  as  possible.  We  claim  that  pro¬ 
viding  instructions  as  major  cycles  is  not  consistent  with  this  goal  and  propose  pro¬ 
viding  instructions  as  minor  cycles.  Such  instructions  will  require  little  decoding  or 
control  circuitry  while  at  the  same  time  yielding  great  flexibility  in  controlling  the 
network.  This  approach  extends  the  RISC  concept  presented  in  [4].  The  pro¬ 
posed  instruction  set  is  shown  in  figure  1  and  its  associated  architecture  is  shown 
in  figure  2.  The  PE  consists  of  a  register  file  (RF),  an  arithmetic  logic  unit  (ALU), 
an  address  register,  a  mask  register,  instruction  queue  (IQ),  I/O  Buffers,  a  simple 
instruction  decoder  and  has  a  data  path  bandwidth  of  16  bits. 

In  our  implementation,  512  PEs  will  be  connected  in  a  CCC  network  using 
wafer  scale  integration  (WSI)  technology.  A  floor  plan  of  the  PE  die  is  shown  in 
figure  3.  Table  1  gives  the  initial  area  estimates  for  each  section  of  the  PE's  cir¬ 
cuits  for  a  scalable  CMOS  technology  with  a  X  =  1.0um  (i.e.,  2pm  drawn  gate 
length).  The  PE  die  size  will  be  approximately  2.42mm  X  1.53mm.  Assuming  that  a  6 
inch  diameter  (15  cm)  wafer  is  used,  there  will  be  only  10cm  X  10cm  of  usable  area. 
Given  that  the  inter-PE  and  global  routing  consumes  50%  of  the  area,  there  will  be 
1346  possible  functioning  die  sites.  With  a  yield  of  50%,  approximately  670  function¬ 
ing  dice  (more  than  the  minimum  requirement  of  512)  will  be  available  after  fabri¬ 
cation  . 

4.  Implementation  of  Highly  Parallel  Computers 

When  implementing  a  large  parallel  computer,  the  designer  must  remain  cog¬ 
nizant  of  how  constraints  such  as  physical  size  (i.e.,  layout  area),  PE  node 
degree,  instruction  broadcasting,  interprocessor  communications  delay  and  I/O  con¬ 
straints  behave  as  a  function  of  the  number  of  PEs.  The  CCC  network  performs 
well  in  all  of  these  areas.  In  [1]  it  is  shown  that  instruction  distribution  can  be 
pipelined  such  that  only  O(log  n)  PEs  need  to  be  provided  with  an  instruction  at 
each  execution  step,  while  the  I/O  latency  is  shown  in  [6]  to  also  be  0(log  n). 

The  physical  size  and  communication  delays  can  be  minimized  through  the  use 
of  Wafer  Scale  Integration  (WSI).  Large  digital  circuits  can  be  fabricated  through 
WSI  technology  to  yield  a  complete  component  encompassing  an  entire  silicon  wafer. 
The  silicon  wafer  is  packaged  intact  and  can  be  used  as  a  component  in  conven¬ 
tional  (i.e.,  board  level)  design  methods.  In  WSI,  the  PEs  form  the  die  sites  on  the 
wafer.  The  fabrication  of  these  die  sites  is  identical  to  that  of  current  VLSI  fabri¬ 
cation  technology.  After  fabrication,  the  devices  are  tested  and  functioning  dice  are 
connected  in  place.  The  physical  size  of  the  complete  system  is  therefore  reduced  to 
the  size  of  the  wafer.  Communication  delays  are  reduced  since  the  longest  path 
between  processors  is  at  most  the  wafer’s  diameter.  The  delays  associated  with 
driving  signals  off-chip  as  in  board  level  design  are  reduced  to  the  order  of  the 
wire  delay  within  a  chip.  An  effective  method  must  exist  to  interconnect  the  func¬ 
tioning  devices  on  the  wafer  if  WSI  is  to  be  considered  a  viable  technology  [5].  A 
discretionary  wiring  technique  [2]  and  a  laser  link  technology  [7]  are  being 
evaluated  for  this  project. 
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Abstract 

This  paper  presents  a  mathematical  model  of  reliability  for 
tree-structured  networks  of  processors.  It  assumes  that  processing 
elements  have  an  equal  probability  of  failure,  and  that  interconnec¬ 
tion  links  also  have  a  fixed  failure  probability.  From  these  two  pro¬ 
babilities  and  the  number  of  nodes  in  the  network,  a  general  formula 
is  derived  for  the  average  number  of  nodes  remaining  connected.  It 
is  shown  that  if  an  operational  tree  of  a  certain  size  is  desired,  it  is 
usually  more  efficacious  to  improve  the  reliability  of  individual  pro¬ 
cessing  elements  and  links  than  to  add  more  nodes  to  the  network. 
Indeed,  to  lake  best  advantage  of  a  large  network,  the  reliability  of 
the  individual  links  must  be  higher  than  in  smaller  networks.  Tht 
reliability  of  a  randomly  structured  network  is  then  compared  with 
that  of  an  n-ary  tree  network.  The  n-ary  tree  is  shown  to  be 
significantly  more  reliable  than  the  average  tree. 


I.  Introduction 

Interest  in  distributed  processing  has  grown  in  step  with  the 
declining  cost  of  processing  elements  and  advances  in  networking 
techniques.  These  developments  have  lent  added  importance  to  the 
issue  of  reliability,  because  when  processing  is  distributed  among 
many  nodes,  a  failure  in  any  processor  or  communication  link  can 
impede  the  progress  of  the  entire  system.  For  example,  a  process  at 
one  node  may  remotely  call  a  procedure  at  another  node  to  perform 
some  service.  The  called  procedure  may.  in  turn,  make  use  of  files  or 
peripherals  to  which  it  has  exclusive  access,  or  it  may  call  further 
procedures  remotely.  Distributed  program  reliability  1  is  a  measure 
of  the  dependability  cf  such  a  system.  Several  other  reliability  meas¬ 
ures  have  also  been  developed  1  -  6  . 

In  this  paper  we  define  a  very  simple  measure  of  reliability  for 
Unstructured  networks  in  terms  of  the  failure  probabilities  of  nodes 
and  links.  By  studying  the  entire  class  of  tree  structures  and  deriv¬ 
ing  statistical  averages  for  their  reliability,  we  gain  a  quantitative 
insight  into  a  wide  variety  of  interconnection  schemes,  and  discover 
how  the  reliability  is  affected  by  changes  in  network  site  Since  other 
networks  are  simply  trees  with  added  communication  links,  this 
study  also  provides  a  lower  bound  for  the  connectedness  of  general 
networks. 

Consider  a  distributed  system.  Its  processing  elements  are 
prone  to  failure  or  malfunctioning.  The  communication  links  may 
also  break  down,  either  because  of  hardware  faults  or  software 
errors.  When  processors  or  links  fail,  the  network  may  split  into 
more  than  one  subnetwork,  with  nodes  in  different  subnetworks 
unable  to  communicate. 

If  we  select  a  node  in  the  network,  the  more  nodes  that  remain 
connected  to  it.  the  more  nodes  il  can  communicate  with.  and.  hence, 
the  more  reliable  is  the  network  topology  .Assume  that  we  choose  a 
node  randomly  from  the  nodes  in  a  tree-structured  network  such  that 
each  node  is  equally  likely  to  be  selected.  Then  we  can  calculate  the 
average  number  of  nodes  in  the  still-connected  subtree  containing  the 


chosen  node.  We  will  study  how  this  average  is  affected  by  the  pro¬ 
babilities  of  node  and  link  failures  and  the  number  of  nodes  in  the 
original  tree.  We  will  demonstrate  that  the  size  of  the  connected 
subtrees  is  not  very  sensitive  to  the  probability  of  failure,  as  long  as 
the  probability  of  failure  is  small.  We  will  also  show  that  failure  of 
nodes  and  links  affect  a  tree  more  severely  as  the  tree  grows  larger; 
for  example,  for  any  given  probability  of  failure,  a  program  is  more 
likely  to  be  able  to  use  20^  of  the  nodes  in  a  small  tree  than  20^  of 
the  nodes  in  a  large  tree.  Finally,  we  will  show  that  it  is  desirable  to 
structure  a  tree  network  as  a  regular  ”n-ary  tree,  because  its  reliabil¬ 
ity  is  much  higher  than  that  of  the  “average"  tree  structure. 


2-  Connectedness  after  Failures 

Assume  that  the  processing  elements  of  a  mult  iprocesv'r  r.T 
computer  network  are  represented  by  the  nodes  of  a  graph,  an  3  that 
the  edges  correspond  to  interprocessor  links  Then  the  failur*  >.f  a 
processor  or  link  is  equivalent  to  the  removal  of  the  associated 
or  edge  from  the  graph. 

Neville  7  has  shown  that  labeled  trees  with  n  node*,  are 
one-to-one  rorrespondence  with  |r  Ji-tuples  of  symbols  rh  •«*r.  '»  « 
the  set  jl,  2,  .  n In  other  words,  the  structure  of  any  particu.v 
labeled  tree  of  n  nodes  can  be  encoded  by  a  particular  ln-2l-rjp> 
If  a  tree  7\,  is  chosen  from  this  set.  and  some  nodes  and  edc*‘  ar* 
removed  (by  failure,  for  example),  the  resulting  graph  is.  in  g'nera 
a  forest  of  disjoint  subtrees  Let  r  (5)  represent  the  number  f 
nodes  m  a  particular  subtree  S.  If  we  include  the  null  subtrees  r  > r: - 
sitting  of  removed  nodes,  then  *r  ($>  may  take  on  values  0,  1. 

.  n.  As  a  6rst  step  in  the  analysis,  we  will  find  the  average  number 
of  nodes  in  such  a  subtree  under  the  following  three  assumptions 
II)  Each  tree  Tn  in  the  set  of  labeled  trees  with  n  nodes  is  equa  .y 
probable.  Every  tree  has.  therefore,  probability  n  **  f 
being  chosen. 

(2)  A  particular  node  c  is  chosen  from  the  tree  in  such  a  »j» 
that  any  node  is  equally  likely  to  be  chosen.  Every  nod'  ha' 
therefore,  probability  n  of  being  chosen.  After  edges  or 
nodes  are  removed  by  failure,  we  let  5,  denote  the  particular 
subtree  (in  the  resulting  forest)  that  contains  nod'  -  If  n  de 
itself  has  been  removed,  then  S.  is  defined  to  be  the  null  tree 

(3)  Nodes  and  edges  are  removed  from  r„  such  that  any  node  lor. 
vertex)  haj  probability  P9  of  being  removed  and  probability 
(l-F.)  of  not  being  removed  Similarly,  any  edge  has  proba¬ 
bility  P,  of  being  removed  and  probability  M  P,  i  .  f  no  b'  ng 
removed.  Removal  of  a  node  or  edge  is  independent  of  the 
removal  of  other  nodes  and  edges  from  Tm ,  and  of  the  choice  of 
node  c.  We  will  use  the  term  ’  failure'  as  a  synonym  f  r 
“removal." 

We  will  use  these  assumptions  to  derive.  fir*t  of  ail.  an  eirre*M  -n  f.-r 
t  in  j  [  S,  )  ,  the  average  number  of  nodes  in  the  subtree  - 

Define  a  random  variable  »r  *  \  >  equal  to  the  number  f  fared 


4«  |*. 


m& 


edges  in  the  original  tree  f,  that  connectea  nodes  of  5,  to  nodes  not 
in  St.  Let  it,*,  n.9,  ,  n„T  (i'  J*  represent  the  nodes  not  in  S,  that 

were  ndjncent  to  some  node  in  S4  before  the  remove]  of  nodes  end  or 
edges  from  T,. 

Let  tu  form  e  new  tree  T.'fS.I  by  teking  the  originel  tree  T. 
end  “eollepsing”  ell  of  the  nodes  in  5,  into  e  single  "giant'  node. 
More  formally,  f^'IS.)  consists  of  the  n  -»lir^(5,  1  nodes  not  in  5,  end 
another  node  (the  giant  node)  of  degree  8r^(5, ).  r„'(S, )  is  formed 

by 

•  removing  from  f„  the  nodes  end  edges  of  S, .  together  with  the 

9r  (5,)  edges  between  nodes  n,  *,  <  =  1,2,  ,»t_(S,|  »nd  some 

node  in  5, .  end  then 

•  reinserting  these  »r_(S.)  edges  between  nodes  «,  *  end  the  gient 
node. 

(.'sing  the  above  definitions  we  will  show  that 

£*r(*.l  =  Vf>rl*T  (S.)  =  I!  I  (1) 


P»l*r  IS.)  -  f|  » 


h-p.i1  (i-p.i1  1  (  ip. -p. -p.p.) 


,'ii-p. Hi  f. i 1 


when  I  -  1,2,  ,  « 


The  total  number  of  trees  7"*  j 5. )  such  that  di(5, )  =  I  and 
•(S. )  -  )  (i.e.,  whose  giant  node  was  formed  by  collapsing  I  nodes 
and  has  j  edges  to  the  rest  of  the  tree)  is  equal  to  the  total  number 
of  trees  teith  n-l-1  nodes  when  the  degree  of  a  node  is  specified  to 
be  ;  In  Neville  s  codification,  the  node  corresponding  to  the  giant 
node  of  *  r.|S,  I  must  occur  ;-l  times  in  the  n-l-l-luple  that 
represents  T «  IS.)  These  >-)  entries  in  the  n-f-l-tuple  can  be 

In I  | 

chosen  in  I  ,  _  ,  ways  Each  of  the  remaining  entries  in 

the  n  -  I  -  l-tup’*  may  be  chosen  in  n  -  /  ways,  as  there  are  n  -  I  other 

I  1  -  I  -  I  ) 

nodes  Thus  there  are  |  j  (n-/)”'1''  distinct  labeled  trees 

T,  IS.  I  such  that  *(S.  I  «  I  and  »|S, )  =  j.  Since  the  j  edges  that 
join  the  giant  node  to  nodes  n,',  n,V  ,  n,  9  may  be  connected  to 
any  of  the  I  nodes  in  S,  (any  of  the  /  nodes  that  "make  up"  the 
giant  a  ode),  the  number  of  trees  T »  which  contain  a  tree  7"„  (S, ) 
such  that  *(5, )  *  I  and  0(5, )  *  j  is  equal  to 

I”  ,  '  I  In-IP'"'  I’ 


The  node*  in  5r  can  be  chosen  in 


ways,  because  when 


oar  node  (node  r)  is  specified  to  be  in  St ,  the  remaining  /  -I  nodes  in 
5,  may  be  chosen  from  the  remaining  n-1  nodes  in  7**.  Since  St 
itself  can  be  formed  in  ‘  ways,  there  are 

I"- 1‘  I  h-i  1  " 

trees  f*  such  that  *|5.  )  -  /  and  »(5r)  =  This  is  the  number  of 
trees  that  contain  a  particular  giant  node. 

We  are  looking  for  the  probability  that  th#*  .  n^d***  m  .' 
remain  connected,  independent  of  any  failures  in  the  r»*t  of  th' 

Since  we  assume  that  nodes  and  edges  ar*  r»*m^v»d  md-p*nd-’.t!>  f 
each  other  and  of  other  nodes  and  edges  in  t h **  tr**e,  ih*»  i-r-.babil.i v 
that  the  1  nodes  and  /-|  *dg*<  did  not  fail  is  b» 

n -p,y  \i-p4\  .  No*. 


Pria  particular  edge  or  node  or  both  is  removed  -  P„  -  Pt  -  P,P, 

Therefore,  the  probability  that  there  is  a  subtree  of  size  .  whose  con¬ 
nection  to  the  rest  of  the  tree  has  been  removed  by  the  failure  of  ; 
edges  or  attached  nodes  can  be  calculated  from 

•  the  number  of  «-/*i-node  trees  containing  a  particular  giant 
node, 

•  divided  by  the  number  of  n-node  trees, 

•  times  the  probability  that  none  of  the  nodes  or  edges  within 
the  giant  node  have  failed,  but  that  all  ;  connections  between 
the  giant  node  and  the  rest  of  the  tree  have  been  removed  due 
to  the  failure  of  nodes  or  edges. 

Thus,  for  ;-l,  2,  ,  n-/, 

PH* T  IS<)  =  /}  * 


1  n-1  U-t- I 

(1  -  (1  -P,)‘"  | P .  -  P.  -  P.P.y  when  1=1,  ,  n 

0  otherwise 

Since  the  giant  node  may  have  been  connected  to  the  rest  of  the  tree 
by  up  to  n  -  /  edges,  Hr  (5C )  can  take  values  1,2..  ,  n  -  Hence, 

<«  -i 

PrN*rJ5e)  *  t\  =  V  Prlii,rJSe)  =  /  wT(Se)  =  ;f 
,  - 1 

and  by  substituting  for  Pr{0/  rJSt )  =  /i0r  (5e)  =  ;}  we  obtain 
Pr{0.T  (5.)  =  /J  = 


(i -p.y  (i i1--' 


-I  J  - 1 


In  l|"  1  '  '  /'  IP.  -  P,  -  P.P,  )' 


By  substituting  j '  *  j  -  I  and  using  the  binomial  theorem  for  the  sum 
on  w e  obtain  equation  (2)  after  some  rearrangement.  This  com¬ 

pletes  the  derivation  of  equation  (2). 

Now  consider  some  speciaj  cases  that  can  be  derived  from  equ* 
cion  (2). 

•  if  only  processing  elements  (nodes)  fail,  not  links  '.re-. 
p.=  o  and  equation  ( 2 1  simplifies  to 

Pda.T  (S.)  =  I  P,  *  0:  = 


j  H-P.)'  j,  |/’.(w-i-iP.  >“  —  ket. 

•  Similarly,  if  vniv  links  not  processing  rir-ren'.v  '» 
and  equation  >2)  tiecorres 

Pr;.T  IS,  I  =  .  P,  0 


•  If  processing  eiem.nl.  *r 
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f  when  /*  0 

-Jrr  H-P)3,-‘  j")p(2-P)n-/(i-P)J|-‘-'  wh.n  /  -  i.... e 

That  special  cihi  arc  discussed  in  greater  detail  in  18*. 


1-1  remaining  live  nodes  may  be  selected  in 


n 

l_l  I  ways.  A  node 


remains  alive  if  it  neither  it  nor  the  link  connecting  it  to  the  root 
fails;  the  probability  of  this  is  y  *P,~P,  -P.P,.  If  t  is  the  root 
node,  the  probability  density  function  is  given  by 


PrlOir-.fS.)  “  I idegree( e )  -  T|)  » 


t.  Connectedness  of  a  Large  Network 

At  the  number  of  processing  elements  in  a  network  grows  large, 
how  is  the  connectedness  affected?  Do  the  still-connected  subtrees 
(giant  nodes)  grow  larger  with  the  sise  of  the  network,  or  it  there 
some  threshold  beyond  which  their  sise  levels  off?  To  investigate  this 

question,  we  will  derive  a  closed-form  expression  for  lim£*T  (5,  )j. 

•  • 

We  first  note  that 

Urn  (n-l-lxr-1-'  -  (3) 

n-» 


P.  when  1  *  0 

(l“P.)  |j_j  when  I  »  1,  2,  ,  «  ^ 

If  c  is  a  leaf  node,  1  •  1  if  node  c  does  not  fail  but  either  the  edge  con¬ 
necting  it  to  the  rest  of  the  tree  or  the  root  node  fails.  If  neither  one 
of  these  cases  occurs,  then  all  of  the  other  1-2  live  nodes  an  be 
reached  from  c;  these  1-2  nodes  are  selected  from  n-2  possibilities. 
Thus, 


By  substituting  equation  (3)  into  equation  (2).  we  can  verify  by  a  lit¬ 
tle  algebra  that 

Um  Pr|*r  (5.)  -  1}  - 


r. 


when  1  =  0 


£(i-p.),(i-p.),-,(p.*p.-p.p.)e',,"'-'"',v 
*'  when  I  -  1.2... 


)  *  lid*jre«(c)  *  }  =* 

p. 

(i -P.HP.-P.) 
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Since 


when  I  “  0 
when  I  =  I 

when  I  «  2,  3,  ,  it 


Since  V  Pr{*T_(S, ) 
1-0 


1}  mutt  equal  1,  we  obtain 


£(1  -P.)‘(t  -  P, 


(P.-P.-P.P.) 


«) 


Note  that  P.-P.-P.P.  -  1  -  (l-P.Hl-P.)  and  then  differentiate 
both  tides  of  (4)  with  respect  to  (I  -  P.|(l  -  P.).  The  derivative  of  the 
left-hand  tide  is 

'I*  .  .  ,  ii  t  r  it  i  -  p  i 

v  77  (i-iMi-p.r-ii-p.r-r 


Prjdegreefc )=  n  -  I J  =  and  Pr(degree(c)“  1)  =  — — -,  (8) 

we  can  calculate 

1  IPr^T-JS,)  =  1)  (9) 

By  substituting  equations  (6).  (7).  and  (8)  in  equation  (9)  and  using 
the  binomial  theorem  we  obtain 

£!*r-  i  =  ■^■♦(l-P.I  —  (P.*P.) 

•  n  n 


*  1  jr<‘-p.)‘  ’(i-p.i"'(-')' 


*  Vt-P.,~ti-P.-P.  -P.P.  i 


After  some  algebra,  we  derive 


Thus,  our  final  expression  is 

lim  £  *r  1 J. )’ 

-  -  • 


I-P. 

(P. -P, -P.P.)' 


I-P. 

(P. -P. -P.  P.|- 

(5) 


4.  Connectedness  of  a  Specific  Regular  Tree 

A  common  way  of  structuring  an  interconnection  network  is  as 
a  balanced  n-ary  tree.  We  will  analyse  this  structure  because  it  is 
interesting  in  its  own  right,  and  also  because  it  can  serve  at  an  exam¬ 
ple  of  how  specific  tree  structures  can  be  studied.  We  will  c  .ntider 
an  n-ary  tree  of  height  one.  The  results  can  be  generalited  to  trees 
of  arbitrary  height  by  “collapsing"  nodes,  at  explained  in  Section  2 
An  n-ary  tree  of  height  one  has  it"q-l  nodes,  all  but  one  of  which 
are  leaf  nodes.  We  will  denote  such  a  tree  by  f*,;  then 
Pr(*T-  (3.)  *  1}  •*  the  probability  that  I  nodes  remain  connected  in 
the  subtree  J,  containing  node  r  after  the  failure  of  nodes  and  edges 
in  f\.  .As  before,  we  assume  that  every  node  has  an  equal  chance  of 
being  selected  as  node  c.  We  need  analyte  only  two  cases:  where  the 
node  c  is  the  root,  or  where  it  is  a  leaf  node.  These  cases  are  illus¬ 
trated  below. 

If  c  is  the  root  nod*,  then  either  it  fails  or  it  does  not  fail.  It 
fails  with  probability  P.t  in  this  case,  i  is  0.  Otherwise,  1  >0  and  the 


♦  (|  - p. ) A"  ~2)  (|-P. -P. -p.  />. )- 

From  this  equation  we  derive 

Bm(£-*r-.|/«l  *  (1-P.)(1-P.-P.-P.P,)-  (10) 

Note  that  the  last  term  of  this  expression  tends  to  infinity  as  n  grows 
large.  Compare  the  result  for  n-ary  trees  with  the  connectedness  of 
the  average  tree  derived  in  Section  2,  which  tends  to  a  constant  for 
large  n.  This  indicates  that  bushy  n-ary  trees  provide  much  better 
than  average  reliability. 


fi.  Discussion 

Plotting  some  values  of  equations  (2)  and  (5)  gives  an  insight 
into  the  impact  of  variations  in  values  for  P.  and  P. .  Figure  1  plots 
several  curves  for  the  case  where  only  links  fail.  It  shows  how  the 
average  connected  subtree  site  varies  with  changing  P, .  Three 
curves  are  shown,  one  for  100-node  trees,  another  for  1000-node 
trees,  and  the  third  for  infinite-site  trees.  Figure  1(a)  shows  that 
there  is  a  sharp  dropoff  in  average  subtree  sit*  for  P.  >  0.:.  However, 
most  practical  networks  have  P,  ■  -  0.1,  and  the  behavior  of  th*  curve 
in  this  region  is  not  easy  to  discern  from  a  linear-scale  graph.  Ficur* 
1(b),  a  log-log  plot,  reveals  that  most  of  the  decline  in  averse*  sui- 
tree  site  occurs  (for  100-  and  1000-node  trees)  when  P,  10  ;  th* 

trees  remain  almost  wholly  connected  at  link  reliabilities  higher  than 
this. 
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It  la  also  clear  from  Figure  1  that  for  higher  values  of  n,  the 
average  subtree  site  begins  to  decline  at  lower  values  of  P,  (higher 
suitabilities).  Therefore,  to  take  best  advantage  of  large  networks, 
the  reliability  of  the  individual  links  must  be  higher  than  in  smaller 
networks. 

Figure  2  shows  that  the  phenomenon  of  node  failure  has  effects 
similar  to  those  shown  in  Figure  1(a)  for  the  case  of  link  failure. 
Figure  S  shows  the  effect  of  simultaneous  processor  and  link  failures 
on  an  infinite-node  tree.  When  both  nodes  and  edges  fail  in  an 
infinitu-sise  tree,  its  average  subtree  site  is  very  nearly  equal  to  the 
average  subtree  site  of  a  100-node  tree  where  only  nodes  or  edges 
fail.  This  result  can  be  seen  by  comparing  Figure  3  with  Figures  I 
and  2. 

The  analysis  also  implies  that,  when  a  certain  average  subtree 
site  it  needed,  it  may  be  more  effective  to  improve  processor  or  link 
reliability  than  to  increase  the  site  of  the  network,  even  by  a  huge 
amount.  For  example.  Figure  4  reveals  that  when  P.  is  0.1,  an  aver¬ 
age  subtree  site  of  25  can  be  obtained  from  a  100-node  tree  with  P, 
of  0.04  but  if  the  probability  of  edge  failure  is  increased  to  P,  =0.10, 
an  infinite  tree  is  required. 

Section  4  studied  n-ary  trees  and  established  (equation  (10)) 
that  as  ■-*,  the  average  subtree  site  increases  without  bound.  Fig¬ 
ure  5  plots  curves  for  “average'’  trees  superimposed  on  curves  for  n- 
ary  trees.  For  n-»ry  trees,  eleven  curves  are  shown,  while  for  “aver¬ 
age"  trees,  only  three  curves  are  shown  (because  the  others  would 
nearly  coincide).  It  can  be  seen  that  more  nodes  remain  connected  to 
an  arbitrary  node  in  an  n-ary  tree  than  in  an  “average"  tree;  indeed, 
for  P,  >0.1,  an  n-*ry  tree  of  modest  site  (150  nodes)  has  a  higher 
connectedness  than  even  an  infinite-site  ''average"  tree. 


hypercubes,  and  generalized  hypercubes  (the  Ai.pha  structure). 
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Introduction 

Highly  parallel  architectures  require  efficient,  high-bandwidth  networks 
interconnecting  a  number  of  processing  elements.  Loop  interconnection  structures  are  of 
interest  because  of  their  advantages  of  simplicity,  modularity,  and  expandability.  As 
VLSI  technology  permits  the  implementation  of  an  ensemble  of  processors  on  a  single 
chip,  loops  becomes  increasingly  attractive  because  they  are  inherently  planar  and  can 
be  implemented  with  a  high  degree  of  regularity.  This  paper  reports  the  results  of  a 
simulation  study  of  the  potential  performance  of  loop-connected  multiprocessor  systems. 

Loop  Operation 

While  a  variety  of  loop  configurations  have  been  proposed  as  interconnection 
structures  [lj,  this  paper  focuses  upon  a  packet  loop  as  illustrated  in  Fig.  1.  The  packet 
loop  moves  parallel  word  packets  from  the  ith  position  to  the  ith-1  position  on  each 
clock  cycle.  Packets  typically  consist  of  a  control  field,  a  source  field,  a  destination  field 
and  a  data  field.  A  processor  sends  a  packet  to  another  processor  by  writing  the  packet 
to  an  empty  slot.  Each  processor  monitors  the  loop  continuously  and  removes  any 
packet  containing  its  destination  address.  This  type  of  loop  was  initially  proposed  by 
Pierce'2!  and  has  been  incorporated  into  actual  multiprocessor  systems  3.4  .  Previous 
studies  of  loop  performance  have  focused  upon  the  loop  mainly  in  computer  network 
applications^, 6i  and  not  as  an  interconnection  structure  to  support  parallel 
computations. 


Fig.  1  Packet  Loop  Concept 


Overview  of  the  Simulation 


The  simulation  assesses  the  potential  processing  power  provided  by  an  ensemble  of 
processors  under  the  communication  constraints  imposed  by  a  particular  interconnection 
network.  Processors  are  considered  to  be  computers  with  their  own  private  memories 
that  support  multiple  processes.  Processes  communicate  with  each  other  through 
messages  that  use  the  interconnection  structure.  The  model  assumes  that  each  processor 
is  linked  to  the  interconnection  network  through  an  input  and  an  output  queue.  When 
data  arrives  from  the  network,  it  remains  in  the  input  queue  until  required  by  the 
processor.  When  a  processor  produces  a  data  item  to  send  to  another  processor,  it  is 
placed  in  the  output  queue  until  the  network  can  accept  it. 

A  task  is  modeled  as  a  period  of  time  during  which  five  types  of  activity  might 
occur  as  illustrated  below: 


E 

Io 

The  symbols  are  defined  as: 

Wt  -time  spent  waiting  on  arrival  of  needed  input 
/,  -time  spent  accepting  input  from  the  input  queue 
E  -actual  processing  time 

I0  -time  spent  writing  results  to  the  output  queue 

W0  -time  spent  waiting  for  access  to  the  interconnection  network 

To  model  interdependent  processes,  the  simulation  sends  the  output  of  a  producer 
process  to  the  next  consumer  process  that  is  awaiting  data.  Should  no  process  need  data, 
a  recipient  is  chosen  at  random. 

Simulation  Results 

The  measure  of  the  efficacy  of  the  interconnection  structure  produced  by  the 
simulation  is  processing  power.  Processing  power  is  simply  the  average  number  of 
processors  that  are  in  the  E  state.  To  establish  a  baseline  for  comparative  study,  exactly 
the  same  processing  tasks  were  simulated  for  both  a  demand  multiplexed  bus  and  a 
packet  loop.  The  average  of  the  simulation  output  for  several  sets  of  task  data  is  shown 
in  Fig.  2.  Here  the  execution  time  is  uniformly  distributed  with  a  mean  of  20  bus  cycles 
and  the  input  and  output  times  are  25%  of  execution  time.  The  loop  performance  in  this 
case  is  slightly  inferior  to  the  bus  until  12  processors  are  employed  and  is  only  marginally 
better  than  the  bus  when  12  or  more  processors  are  employed. 

On  the  surface,  the  results  shown  in  Fig.  2  would  seem  to  indicate  that  a  loop 
interconnection  structure  offers  little  advantage  over  a  conventional  demand  multiplexed 
bus.  The  performance  of  a  multiprocessor  ensemble  using  the  bus  depends  upon  the 
number  of  processors,  the  fraction  of  time  the  bus  is  needed  and  the  bus  bandwidth. 
Willi  a  loop  interconnection  structure,  however,  the  physical  arrangement  of  t  he 
processor-  on  the  loop  relative  to  the  communication  needs  of  the  individual  proces-e- 
becomes  important.  A  loop  connected  system  will  exhibit  its  best  performance  when 
each  processor  sends  its  results  to  its  immediate  down  stream  neighbor.  The  worst  case 
for  the  loop  results  when  each  processor  is  required  to  send  its  results  completely  around 
the  loop  to  its  immediate  up  stream  neighbor.  Figure  3  shows  the  results  of  the 
simulation  for  both  the  best  and  worst  cases  of  the  loop  as  compared  with  the  bus. 
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Fig.2  Typical  Bus  and  Loop  Perforn 
(I/O  Time=25%  of  Execution  Tin 


oi  execution  Time) 


Conclusion 

One  cannot  state  categorically  that  a  loop  interconnection  structure  performs 
better  than  a  demand  multiplexed  bus.  With  a  judicious  choice  in  binding  of  processes 
to  processors,  the  performance  of  the  loop-connected  system  can  grow  almost  linearly 
without  bound.  However,  a  poor  choice  in  the  assignment  of  processes  to  processors  in  a 
loop  connected  system  can  result  in  performance  that  is  even  worse  than  the  bus  in 
saturation.  This  study  indicates  that  the  loop  is  most  attractive  in  applications  in  which 
traffic  patterns  are  known  and  can  be  synchronized. 
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EXTENDED  SUMMARY 

In  a  distributed  computing  system,  a  job  is  usually  decomposed  into  several 
cooperating  tasks  and  then  assigned  to  processors  in  order  to  exploit  the  inherent  paral¬ 
lelism  in  job  execution.  The  distributed  computing  system  and  cooperating  tasks  can 
usually  be  represented  by  a  pr  occts  or  graph,  GP  =  (VP,  EP),  and  a  taak  graph  ,  GT  == 
(VT,  E-r),  respectively.  Here  VT  is  the  set  of  nodes,  each  representing  a  task  of  the  job, 
and  E*j>  C  VT  X  VT  is  the  set  of  edges,  each  representing  intertask  communications 
between  the  two  nodes  adjacent  to  this  edge,  and  VP  is  the  set  of  processors  in  the  sys¬ 
tem  and  EP  C  VP  X  VP  the  set  of  edges  representing  communication  links  between  pro¬ 
cessors.  Any  two  adjacent  nodes  in  the  task  graph  indicate  the  existence  of  direct  com¬ 
munications  between  them.  (Such  tasks  are  said  to  be  related  to  each  other  ) 

Any  two  related  tasks  are  required  to  be  assigned  to  the  same  processor  or  two 
adjacent  processors.  Henceforth  this  will  be  termed  the  adjacency  requirement.  Using 
some  sort  of  criterion  (e.g.,  load  balancing),  the  problem  of  deriving  an  optimal  task 
assignment  is  usually  formulated  as  a  nonlinear  integer  programming  problem.  There 
are  three  common  approaches  to  the  problem.  0-1  integer  programming,  use  of  graph 
theory,  and  heuristics  [l,  2). 

The  task  assignment  problem  is  known  to  be  NP-complete  (3,  4).  In  [l],  the  task 
assignment  problem  is  formulated  as  a  state-space  search  problem  and  then  solved  by 
the  heuristic  algorithm  A*.  However,  without  the  knowledge  of  the  site  of  the  state- 
space,  one  cannot  tell  the  maximum  number  of  expanded  nodes  in  the  A*  algorithm;  its 
complexity  is  difficult  to  analyze.  Moreover,  the  knowledge  of  the  number  of  acceptable 
assignments  can  be  applied  in  the  state-space  search.  A  search  method  with  this 
knowledge  —  although  it  may  reach  a  suboptimal  goal  node  instead  of  the  optimal  one  -- 
requires  much  less  computation  cost  in  the  state-space  search  and,  thus,  provides  a  use¬ 
ful  insight  into  the  state-space  search  problem. 

In  addition,  the  knowledge  of  the  number  of  acceptable  assignments  and  its  relation 
with  the  processor  and  task  graphs  can  play  an  important  role  m  the  design  of  a  distri¬ 
buted  computing  system,  when  the  system’s  admissibility  of  incoming  tasks  is  a  major 
consideration.  In  other  words,  one  needs  to  determine  the  system's  structure  which 

'The  work  reported  in  this  paper  was  partially  supported  by  Office  of  Nava!  Research,  ('ontract  \c 
N000I4-8.S  K-012C 


allows  as  many  acceptable  assignments  as  possible  for  a  given  set  of  cooperating  tasks. 
This  will  in  turn  increase  the  probability  that  more  number  of  incoming  tasks  can  be 
assigned  to  the  system  while  satisfying  the  adjacency  requirement.  This  knowledge  can 
also  be  used  as  a  measure  of  importance  of  each  edge  in  Gp,  i.e.,  importance  of  interpro¬ 
cessor  communication  links. 

This  work  is  concerned  with  the  determination  of  the  number  of  acceptable  task- 
assignments,  denoted  by  N(Gp,  G«j>),  rather  than  task  assignments  themselves.  Specifi¬ 
cally,  the  bounds  of  N(GP,  G-p)  are  derived  when  the  task  and  processor  graphs  are  arbi¬ 
trary,  and  then  a  recursive  formula  for  the  N(Gp,  GT)  when  the  task  graph  is  restricted 
to  a  tree.  If  GP  =  Rn  or  Gp  =  Qn  in  addition  to  the  assumption  of  GT  =  a  tree,  one 
can  derive  a  non-recursive  closed  form  for  N(Gp,  GT).  These  special  cases  hold  practical 
importance,  since,  for  example,  hypercube  multiprocessors  are  becoming  wide-spread  in 
the  supercoming  arena. 

The  knowledge  of  N(Gp,  GT)  is  applied  to  (i)  the  problem  of  determining  the  inter¬ 
processor  communication  link  which  is  of  the  utmost  importance  to  the  system  admissi¬ 
bility  of  incoming  tasks,  and  (ii)  the  state-space  search  of  the  task  assignment  problem 
where  the  search  is  guided  toward  an  ampler  state-space  (i.e.  with  a  larger  N(GP,  GT)) 
without  computing  the  associated  heuristic  function  for  every  possible  case.  The  appli¬ 
cation  approach  (ii)  is  based  on  the  observation  that  a  larger  number  of  assignments 
implies  that  unassigned  tasks  have  a  better  chance  to  be  spread  out  in  the  system. 
When  the  goal  of  load  balancing  is  more  important  than  that  of  reducing  interprocessor 
communications,  the  likelihood  of  making  a  successful  guess  with  the  approach  (ii)  will 
certainly  increase. 
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1.  INTRODUCTION 

A  distributed  system  is  a  combination  of  a  distributed  architecture  and  distributed  control  algorithms  Distributed 
architectures  cover  a  wide  range  of  architectures  ranging  from  loosely  coupled  multiprocessors  to  long  haul  networks 
In  this  paper  we  consider  distributed  architectures  as  heterogeneous  local  area  networks  (LAN)  Such  an  environment 
consists  of  a  variety  of  computing  resources,  such  as  diskless  workstations,  workstations,  mainframes,  vector  proces¬ 
sors,  and  multiprocessors,  connected  by  a  high  speed  (relatively)  reliable  communication  medium  Distributed  control 
algorithms  manage  the  distributed  architecture  in  order  to  realize  the  potential  advantages  of  the  distributed  archi¬ 
tectures  availability,  reliability,  extensibility,  and  flexibility 

We  are  currently  investigating  distributed  control  algorithms  that  provide  a  reliable  unified  service  environment 
(REUSE)  Such  an  environment  presents  the  distributed  system  as  a  unified  computing  resource  which  can  provide 
services  in  a  location  independent  manner  In  this  paper  we  discuss  some  of  the  concepts  that  we  are  using  in  design¬ 
ing  the  REUSE  system 

2.  CLIENT/SERVER  GROUP  MODEL 

The  primary  form  of  interaction  in  the  REUSE  system  is  for  clients  to  request  that  a  well-known  service  he  per¬ 
formed  The  requested  service  will  be  performed  by  a  particular  server  process  In  order  to  provide  reliable  service  we 
will  encapsulate  all  server  processes  that  provide  a  particular  service  into  a  server  group  jChZw85]  Each  server  group 
will  consist  of  a  set  of  cooperating  servers  that  manage  a  set  of  resources  In  general  each  server  in  the  group  will 
manage  a  subset  of  the  server  group’s  resources  that  are  similar  to  the  resources  managed  by  the  other  servers  in  the 
group  The  redundancy  of  resources  enables  the  server  group  to  provide  more  reliable  service 

Users  will  request  services  through  predefined  service  names  Service  names  will  be  mapped  into  group  IDs  Client 
processes  will  interact  with  a  server  group  by  communicating  with  a  local  agent  (using  the  process  ID)  of  the  group  or 
by  communicating  with  the  group  as  a  whole  (using  the  group  ID)  Some  possible  types  of  server  groups  include 
scheduling  agent  group,  file  management  group,  mail/ communication  server  group,  and  printer  server  group  A 
scheduling  agent  group  controls  the  scheduling  of  processes  at  different  hosts  in  the  system  A  file  management  group 
will  interpret  file  names  and  manage  files  storage  devices  A  mail/commumcation  server  g  oup  will  do  mailing 
address  interpretation,  routing,  mailbox  management,  and  manage  any  gateways  connected  to  the  system  The 
printer  server  group  will  manage  the  different  printing  devices  on  the  system 

As  a  further  illustration  of  server  groups  consider  the  scheduling  agent  group  Each  host  in  the  system  will  have  a 
local  scheduling  agent  which  allocates  the  processors  at  that  host  to  the  ready  processes  The  local  scheduling  agents 
form  the  scheduling  agent  group  These  scheduling  agents  cooperate  in  order  to  implement  some  scheduling  polity 
Some  possible  features  of  the  policy  could  include,  load  balancing  considerations,  fairness  criteria  and  appropriate 
matching  of  jobs  and  host  architectures  Scheduling  policies  may  require  job  migration  NiXC.ga  In  order  to  imple¬ 
ment  some  policies,  client  processes  will  have  to  provide  the  scheduling  agent  with  some  characteristics  of  'lieu  jobs 
As  a  simple  example  consider  the  case  of  the  server  processes  of  a  hie  management  gp'Up  <  hm  >  liar  c  'ei  is:  r<  !  '  .v-.- 
processes  is  that  they  should  not  be  migrated  because  their  responsibility  as  either  a  teal  hie  -eivu  agen’  i  i.-  id- 
..erver  managing  local  devices  (or  both  agent,  and  server)  is  location  dependent  <  >t  her  than-  'eti.-tjis  ■  •  -ni-i  iu  :>■ 
expected  run  tame  and  the  type  of  jobs  (vectorized  or  parallelized) 
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3.  RELIABLE  INTERPROCESS  COMMUNICATION 

The  foundation  of  a  distributed  programming  environment  is  a  set  of  flexible  and  reliable  interprocess  communication 
(IPC)  primitives.  The  general  function  of  IPC  is  to  create  a  set  of  mechanisms  that  allow  location-independent 
exchange  of  information  between  processes  in  the  distributed  system  An  IPC  mechanism  consists  of  various  types  of 
send  and  receive  operations  Various  send  and  receive  operations  should  be  provided  so  that  the  programmer  may 
exploit  potential  concurrency  in  an  algorithm  and  synchronize  the  communicating  processes  when  it  is  necessary  The 
following  table  lists  ten  IPC  commands  that  we  propose  Detailed  definitions  and  usage  of  these  IPC  commands  can 
be  found  in  [NiGe86]  and  will  not  be  elaborated  here 


IPC  Primitive  Type 

IPC  Command 

asynchronous  one-to-one  send 

Send 

synchronous  one-to-one  send 

Send_and_\Araut 

asynchronous  one-to-many  send 

Multicast 

synchronous  one-to-many  send 

Multicast_and_\Vait 

blocking  selective  receive 

Receive 

blocking  non-selective  receive 

Receive_Anv 

non-blocking  selective  receive 

Cond_Receive 

non-blocking  non_selective  receive 

Cond_Receive_Anv 

non-blocking  reply 

Reply 

blocking  reply 

Replv_and_Wait 

Most  of  the  existing  distributed  programming  languages  or  known  IPC  mechanisms  support  one-to-one  interprocess 
communication  However,  in  many  distributed  applications  a  group  of  processes  are  coordinated  to  solve  a  single 
task  These  coordinated  processes  form  a  process  group  [ChZw85,  GeN'i85i  and  may  reside  in  various  physical  proces¬ 
sors  A  process  in  a  process  group  may  frequently  wish  to  broadcast  a  message  to  those  processes  in  the  same  group 
Thus,  a  one-to-many  send  primitive  is  necessary  to  allow  the  message  to  be  sent  to  a  group  of  processes  rather  than  a 
single  process  The  following  group  primitives  are  proposed  to  manipulate  groups  ,NiGe86j 

C'reate_Group(group_id,  processed,  group_type,  group_structure,  jom_method,  status) 
Destroy_Group(group_id,  processed,  kill_processes,  status) 

Jom(group_id,  processed,  timeout  status) 

Invit.e(group_id,  host,_id,  guest_id.  timeout,  status) 

Leave) group_id,  processed,  status) 

Remove(group_id,  processed.  victim_id,  status) 


4.  DISTRIBUTED  ALGORITHM  DEVELOPMENT 

A  distributed  algorithm  involves  a  number  of  processes  coordinated  to  solve  a  single  task  The  cooperation  of  servers 
in  a  server  group  is  a  special  case  of  distributed  algorithms  Distributed  algorithms  have  three  important  features 
negotiation,  remote  state  maintenance,  and  synchronization  |GeNi85!  A  negotiation  is  a  set  of  actions  through  which 
two  or  more  processes  attempt  to  reach  an  agreement  The  agreement  may  be  an  agreement  on  a  shared  value  or  it 
may  be  an  agreement  to  perform  some  actions  A  negotiation  can  take  two  general  forms  coordinated  and  peer  In 
the  case  of  a  coordinated  negotiation,  one  of  the  negotiators  is  called  a  coordinator  and  the  other  negotiators  are 
called  respondents  During  the  negotiation  phase  the  coordinator  provides  a  central  point  of  control  In  .1  peer  nego¬ 
tiation  all  negotiators  (processes)  are  equal  There  is  not  a  centralized  point  of  control  and  n  •  negotiate!  ha.-  1  t:,- 

plete  view  of  the  negotiation  This  is  a  more  distributed  form  of  negotiation  and  a  form  whu  li  is  important  in  distri¬ 
buted  algorithms  An  example  of  a  peer  negotiation  is  an  election  algorithm 

Another  important  feature  of  many  distributed  nig-  nihms  is  reeve  -t.,tc  ..  Tv- 

v  p"i  it  mg  must  have  some  kin -wT-dgM  .  .f  e.n  h  ■  leu-  -t  at,-  |tl  1  ••  ;  ’  v  v  . 

their  current  st..*»  to  some  of  the  other  processes  in  ’he  algorithm  I’  ;s  mu-  j Tan*  •  :i  >' 

quently  information  is  exchanged  between  the  pres  esses  ea<  h  pro*  ess  will  have  ,  -light iv  ;  1  if •  t 
The  primary  problems  .issoi  luted  with  remote  state  m. tinted  am  e  ire  'he  fre.juen,  v  f  v  -•  , 

1  m-  amt  f  information  that  should  be  reported  A  f  1  h!<-  tf  exists  between  the  •  ■  !  •im-m;-- 

’lie  v  line  of  m-. ip  ac  curate  information  The  alg  iithm  designer  me.-’  : - i*-:i * . f v  -.- -vt  1  1.  -• 
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changes  to  that  information  that  are  significant  to  the  remote  processes 

Synchronisation  issues  arise  in  the  areas  of  initiation  of  distributed  programs,  termination  of  distributed  programs 
ordering  of  processes,  and  controlling  access  to  shared  values  In  initiating  a  distributed  algorithm,  all  processes  in 
the  algorithm  must  have  identified  themselves  as  ready  to  begin  Problems  arise  in  deciding  how  the  processes  should 
inform  the  others  that  they  are  ready  and  how  to  decide  that  all  processes  in  the  algorithm  are  ready  In  distributed 
termination,  the  primary  problem  is  to  determine  whether  all  processes  have  completed  a  specific  phase  of  the  algo¬ 
rithm  Ordering  of  processes  is  concerned  with  identifying  that  the  actions  which  must  be  completed  before  a  process 
can  start  actually  have  been  completed  Access  to  shared  values  must  be  controlled  so  that  updates  to  those  values 
are  completed  in  an  atomic  manner  This  problem  is  further  complicated  in  a  distributed  system  because  the  shared 
values  may  be  replicated 

Users  may  also  wish  to  create  distributed  algorithms  to  take  advantage  of  the  distributed  system  In  order  to  do  this 
the  user  will  create  a  program  which  will  consist  of  many  processes  that  will  form  a  application  proress  group  The 
application  process  group  has  to  negotiate  with  the  scheduling  agent  group  in  order  to  get  some  processor  resources 
The  application  process  group  could  specify  the  type  and  number  of  processor  resources  that  it  requires  f exactly  or  a 
range  of  minimum  to  maximum  requirements)  The  scheduling  agent  group  would  take  into  account  the  current  load 
of  the  system,  the  number  of  pending  requests,  and  the  priority  of  the  user  and  respond  with  the  actual  resources  it  is 
willing  to  give  to  the  user  The  application  process  group  could  then  configure  itself  according  to  the  available 
resources  it  was  granted.  A  more  dynamic  system  might  allow  the  application  process  group  to  negotiate  with  the 
scheduling  agent  group  throughout  the  execution  of  the  application  program 

5.  RELIABLE  SERVER  GROUPS 

An  important  consideration  in  the  construction  of  server  groups  is  the  investigation  of  the  tools  needed  to  provide 
reliable  service  By  reliable  service  we  mean  that  the  server  groups  can  gracefully  handle  the  loss  of  resources  and 
servers  This  requires  coordinated  replication  of  information  that  is  essential  to  continued  (and  accurate)  performance 
of  service  by  the  server  group  The  servers  in  the  server  group  must  cooperate  in  order  to  identify  the  loss  of  servers 
and  to  do  any  reconfiguration  required  for  continued  service 

Different  types  of  reliability  are  required  depending  on  the  sophistication  of  the  clients  For  example  the  file  servers  of 
a  file  management  group  do  not  expect  the  scheduling  agent  group  to  provide  reliable  execution  of  the  server 
processes  It  is  the  responsibility  of  the  file  mangement  group  to  handle  the  loss  of  servers  due  to  proi  essor  failures 
The  file  management  server  is  considered  a  sophisticated  client  whose  reliability  requirements  are  beyond  the  scope  of 
the  scheduling  agent  group  On  the  other  hand  the  scheduling  agent  group  should  take  some  responsibility  for  the 
reliable  execution  of  a  user’s  job  For  example  if  local  scheduling  agent  migrates  a  job  to  a  remote  processor  it  should 
periodically  communicate  with  that  processor  to  make  sure  that  the  user  process  is  still  being  serviced  If  the  remote 
processor  fails,  the  local  scheduling  agent  should  try  to  restart  the  process  at  another  processor  In  order  to 
differentiate  different  classes  of  clients,  the  client  will  have  to  make  their  characteristics  and  requirements  known  to 
the  scheduling  agent  group 
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A  Virtual  Memory 


We  are  interested  in  the  potential  of  tightly-coupled  multiprocessors  to  provide 
general-purpose  computing  services  in  an  economical  and  readily  expanded  fashion. 
The  computer  system  is  viewed  as  a  pool  of  processors  each  able  to  access  any  of 
a  pool  of  memory  units.  The  load  is  seen  as  a  changing  population  of  processes, 
normally  more  numerous  than  the  pool  of  processors  (P.U.'s).  Some  processes  may 
cooperate,  sharing  subsets  of  their  data,  others  are  completely  independant  of 
one  another.  No  explicit  committment  of  P.U.'s  to  processes  should  be  visible 
to  users,  any  more  than  the  committment  of  physical  memory  to  data  is  visible  in 
a  virtual  memory  system. 

To  allow  arbitrary  cooperation  among  processes  and  indefinite  expansion  of  the 
system,  a  uniform,  very  large  virtual  memory  is  defined.  To  allow  secure 
protection  of  data  and  processes  from  unauthorized  interference,  access  within 
the  virtual  memory  is  controlled  by  a  capability  system.  The  virtual  memory 
contains  uniquely-named  objects,  each  being  either  a  data  set  or  a  process,  and 
knowledge  of  a  valid  capability  permits  some  defined  type  of  access  to  some 
(subset  of  an)  object.  Because  we  aim  to  design  a  system  which  can  transparently 
be  expanded  to  encompass  networks  of  similar  multiprocessor  systems,  the  naming  of 
objects  is  universally  unique.  Each  name  includes  a  volume  serial  number  normally 
identifying  a  disc  volume  and  an  object  serial  number  large  enough  not  to  require 
reuse  within  the  life  of  the  volume.  Objects  cannot  be  moved  from  volume  to 
volume,  but  volumes  may  be  physically  moved  from  system  to  system  without  loss  of 
the  identity  of  their  objects. 

Basic  protection  against  forgery  of  capabilities  cannot  easily  be  arranged  by 
tagging  or  segregation  of  capabilities  from  other  data,  since  we  see  Lt  as  desirabl 
that  data  objects  containing  capabilities  be  accessible  via  communication  links 
not  necessarily  guaranteed  to  observe  the  system's  protection  protocols,  and  by 
any  agent,  human  or  computer,  which  can  exhibit  a  valid  capability.  Capabilities 
are  therefore  protected  by  including  in  them  randomly-chosen  64-bit  "passwords". 
Before  any  capability  is  honoured,  it  is  checked  against  a  catalogue  of  valid 
capabilities  held  in  or  associated  with  the  volume  concerned.  Since  such  a  check 
is  always  required,  there  seemed  little  point  in  attempting  to  encrypt  access 
rights  within  the  capability  itself.  Instead,  these  are  held  in  the  catalogue. 
There  may  exist  many  capabilities  for  an  object  conferring  different  combinations 
of  access  rights.  However,  since  capabilities  are  just  binary  values  which  may  be 
freely  copied  and/or  transmitted  to  trusted  associates,  in  practice  the  number  of 
distinct  capabilities  simultaneously  extant  for  an  object  is  expected  to  be  small. 

It  may  be  thought  that  the  need  to  consult  a  catalogue  before  honouring  any 
capability  would  impose  a  heavy  overhead.  However,  in  any  virtual  memory  system, 
system  tables  must  be  consulted  before  honouring  any  virtual  address  if  only  to 
find  the  current  physical  location  of  the  addressed  information.  As  we  hold 
catalogue  entries  of  current  interest  in  a  memory  cache  area  associated  with  page 
table  information,  no  great  additional  overhead  is  imposed. 

Window  Tables 

Clearly,  one  cannot  require  rechecking  of  a  capability  every  time  a  process  wishes 
to  read,  write  or  execute  some  word  of  an  object.  Processes  are  therefore  allowed 
to  validate,  and  thereafter  hold  in  a  "window  table",  a  number  of  capabilities. 

With  each  such  window  table  entry  is  held  location  information  allowing  direct 
access  to  the  data  addressed  by  the  capability.  Processor  hardware  support  for 


the  window  table  then  allows  access  to  the  addressed  data  to  be  made  without 
further  reference  to  catalogue  or  physical  location  tables.  Processor 
instructions  typically  specify  a  window  table  entry  number  and  a  byte  offset 
within  the  (part  of  the)  object  defined  by  the  loaded  capability.  To  allow  revocation 
of  capabilities,  all  window  table  entries  are  periodically  revalidated. 

The  Intermediate  Address  Space 

Ideally,  when  a  capability  is  loaded  into  a  window  table,  the  process  should  be 
given  the  information  required  physically  to  locate  and  access  any  datum  within 
the  range  of  the  capability.  More  realistically,  a  few  page  locations  might  be 
cached  within  the  processor.  While  feasible,  it  would  be  difficult  to  maintain 
the  currency  of  all  the  page  location  information  held  in  all  the  processors, 
especially  information  relating  to  pages  in  other  multiprocessor  systems.  Instead, 
every  object  of  current  interest  in  a  multiprocessor  is  mapped  into  a  contiguous 
block  of  a  single,  large  intermediate  address  space,  and  processors  retain  in 
their  window  tables  the  intermediate  base  and  limit  addresses  defined  by  loaded 
capabilities.  A  logical  address  specified  by  a  window  table  entry  number  and 
offset  is  translated  to  an  intermediate  address  which  is  sent  to  the  memory 
modules.  The  memory  modules  themselves  map  intermediate  addresses  into  physical 
page  locations. 

As  the  intermediate  address  space  is  large  (2**34  bytes  in  the  present  ptototype) 
objects  can  retain  their  intermediate  addresses  for  long  periods,  of  the  order  of 
hours.  Physical  page  locations,  which  are  much  more  volatile,  are  held  only  in 
the  hash-addressed  maps  within  each  memory  module,  and  are  not  duplicated  in  many 
caches . 

Each  multiprocessor  has  its  own  intermediate  space.  References  by  a  process  in 
one  system  to  objects  in  another  are  mapped  from  one  intermediate  space  to  the 
other.  To  allow  physically  neighbouring  systems  to  communicate  quickly,  we  have 
designed  an  intersystem  communication  unit  which  performs  this  mapping  at  high 
speed.  It  also  provides  a  protection  boundary  between  communicating  multiprocessor 
systems,  because  one  system  may  establish  a  mapping  of  part  of  its  intermediate 
space  into  an  object  in  the  other  system’s  space  only  by  presenting  a  valid 
capability  for  the  object  to  the  other  system.  For  accesses  from  system  A  to 
objects  in  system  B,  the  unit  appears  to  A  as  a  memory  module  accepting  certain 
intermediate  addresses.  Within  the  unit,  these  addresses  are  remapped  using  a 
window  table  containing  capabilities  for  objects  in  B.  Thus  the  unit  appears  to 
B  as  a  processor.  Physically,  intersystem  units  are  constructed  as  modules,  one  of 
which  is  connected  to  each  multiprocessor  system.  Each  module  can  appear  as  a 
memory,  to  accept  outgoing  access  requests,  and  as  a  P.U.,  to  effect  accesses  on 
behalf  of  other  systems. 

Communication  among  physically  distant  systems  would  require  use  of  a  conventional 
network.  As  such  communications  are  necessarily  slow  relative  to  memory  speeds, 
we  plan  to  specify  the  required  accesses  by  explicit  capability/offset  pairs  rather 
than  try  to  maintain  any  intermediate  space-to-intermediate  space  mapping. 

The  existence,  size  and  values  of  intermediate  addresses  are  invisible  to 
processes. 


irocess  Communication 


The  capability-addressed  vertical  memory  allows  sharing  of  data  land  of  course 
code)  objects  among  many  processes.  In  addition,  a  simple  message  scheme  is 
provided.  Every  process  contains  a  mailbox  into  which  other  processes  may  place 
messages,  provided  they  have  an  appropriate  capability.  Message  sending  is  defined 
as  a  primitive  operation  on  the  virtual  memory,  and  direct  modification  or 
inspection  of  mailboxes  is  not  allowed.  Each  mailbox  is  of  finite  size,  so 
processes  are  expected  to  inspect  and  remove  incoming  messages  periodically. 


For  synchronization  purposes,  processes  may  voluntarily  suspend  until  a  message 
is  received  or  a  timeout  expires.  Arrival  of  a  message  normally  does  not 
interrupt  a  process.  Indeed,  we  provide  no  equivalent  of  conventional  interrupts. 
If  a  process  anticipates  the  need  to  take  prompt  unscheduled  action  on  receipt  of 
a  message,  it  may  create  another  process  to  wait  for  such  a  message.  However,  a 
process  with  appropriate  capability  can  forcibly  suspend  another  process,  at  least 
at  the  expiry  of  the  latter's  current  time  slice. 

Since  interprocess  communication  is  oriented  to  processes  rather  than  P.U. 's,  and 
is  not  defined  to  cause  interruption,  no  physical  communication  path  among  P.U.'s 
is  required  for  this  purpose. 


Every  P.U.  must  be  able  to  access  every  memory  module.  A  time-shared  bus  is  a 
convenient,  if  limited-bandwidth,  means.  To  minimise  traffic  on  the  bus,  this 
design  like  most  similar  multiprocessors  relies  on  caches  for  instructions  and  data 
within  each  P.U.  If  shared,  mutable  data  is  held  in  these  caches,  rather  elaborate 
schemes  must  be  used  to  maintain  consistency.  Physically,  such  schemes  require 
either  inter-PU  communication  paths  or  broadcast  signal  paths  to  all  P.U.'s  or  both, 
and  may  require  every  PU  to  monitor  the  memory  traffic  from  other  P.U.'s.  While 
all  these  requirements  can  be  met  in  a  bus-coupled  system,  they  result  in  a  rather 
complex  bus  protocol,  and  may  limit  the  bus  cycle  time  to  the  cycle  time  of  the 
P.U.  cache  memory. 

We  are  skeptical  of  the  need  to  support  a  large  PU  access  rate  to  shared  mutable 
data,  and  do  not  require  inter-PU  communication  for  interrupt  purposes.  We  have 
therefore  chosen  to  allow  shared  mutable  data  to  be  excluded  from  PU  caches,  and 
by  so  simplifying  the  bus  protocol,  allow  a  higher  bus  bandwidth.  The  present 
prototype  uses  a  34-port  32-bit  25  ns  synchronous  bus,  but  larger  systems  of  this 
architecture  could  use  dual  unidirectional  buses,  since  traffic  is  always  PU-to- 
memory  or  memory-to-PU.  The  bus  technique  used  has  been  shown  capable  of 
unidirectional  operation  at  12  ns  cycle  speed.  Further,  since  no  PU  or  memory 
requires  access  to  traffic  of  other  P.U.'s  or  memories,  multiple  disjoint  buses 
or  Omega  networks  could  be  used. 

I/O  Processes 

Disc  storage  for  objects  is  provided  via  disc  controllers  which  access  memory 
modules  in  the  same  way  as  P.U.  's.  Since  the  demand  paging  scheme  covers  all 
objects,  including  the  analogues  of  conventional  files,  no  explicit  disc  access 
action  is  defined  at  the  kernel  level.  Other  peripherals  are  controlled  by 
dedicated  processes  running  permanently  on  dedicated  P.U.'s,  and  user-level  processes 
effect  I/O  by  communication  with  these  I/O  processes.  The  dedicated  P.U.'s  are 
physically  attached  to  a  lower-speed  bus  connected  to  the  main  bus  by  an  adaptor, 
but  share  the  same  virtual  and  intermediate  spaces  seen  by  other  P.U.'s.  They  also 
have  private  local  memory,  and  can  offer  services  such  as  editing  and  command  line 
interpretation. 

Status 

A  prototype  using  NS32032-based  P.U.'s  has  been  operating  since  January  1986.  Most 
features  of  the  virtual  memory  are  implemented.  As  the  P.U.'s  are  too  slow  to  use 
etfectivelv  the  available  bus  and  memory  bandwidth,  bit-slice  RISC  P.U.'s  are  being 
designed.  A  Vax  11/750  is  at  present  used  as  an  I/O  processor.  Fuller  details  of 
the  virtual  memory  interface  have  appeared  in  Anderson,  Pose  and  Wallace,  'A 
Password  Capability  System',  Computer  Journal,  Vol.  29,  No.  1,  pp  1-8,  1986.  The 
bus  design  is  described  in  Wallace  and  Koch,  'ATTL  Compatible  Multiport  Bus', 

Computer  System  Science  &  Engineering,  _1_,  I,  Oct.  1985. 


DIRECTIONS  IN  COMPUTER  GRAPHICS  ARCHITECTURE 


John  Staudhammer 
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In  the  last  ten  years  Computer  Graphics  has  become  a  large  part  of 
every  major  computer  application  package.  Today's  graphics  are  predomi¬ 
nantly  high  resolution  raster  display  oriented  [1.]  The  relatively 
common  high  resolution  Ik  x  Ik  60  Hz  refresh  display  needs  a  pixel 
access  time  of  about  15  ns.  The  High  Definition  TV  systems  now  being 
tested  in  Japan  have  a  1500  line  resolution;  thus  we  may  expect  that  2k 
color  raster  displays  will  become  commonplace  in  the  near  future.  The 
size  of  the  displays  has  also  increased:  currently  35  inch  tubes  are 
available.  Much  progress  has  been  made  in  understanding  display 
algorithms  and  the  generation  of  finely  detailed  'realistic*  images. 
One  can  expect  further  development  in  these  areas.  Computer  graphics 
is  used  for  visualization  of  designs,  of  spatial  relationships,  of 
feature  dependencies,  of  understanding  multi-valued  abstract  inter¬ 
actions.  However,  most  of  the  work  on  image  rendering  has  been  directed 
towards  the  making  of  visually  pleasing,  publishable  static  images. 
These  are  usually  made  up  in  large  image  buffer  memories.  Two  problems 
are  associated  with  this  scheme:  the  image  display  rate  and  the  image 
update  rate. 

The  image  display  rate  is  largely  fixed  for  directly  viewed  pictures. 
Depending  on  the  ambient  light  and  image  brightness,  the  image  needs  to 
be  repeated  once  every  30  to  15  milliseconds  [2];  the  60/second  update 
rate  now  used  is  likely  to  remain  an  acceptable  'standard'.  The  image 
update  rate  will  be  a  compromise  between  affordable  computer  costs  and 
needs  for  visualization.  The  beauty  and  clarity  of  a  display  image  is 
not  the  final  arbiter  for  the  usefulness  of  that  image  (although  so 
judged  now  from  published  still  images.)  When  a  person  needs  to  know 
the  nature  of  an  object,  a  great  many  views  are  required  to  form  an 
understanding.  These  views  must  be  done  in  a  short  time;  a  task  not 
possible  with  static  image  generators.  Typically  the  image  is  computed 
using  a  memory  in  a  general-purpose  machine  and  is  then  transferred  to 
the  display  generator.  In  addition  to  the  computational  requirements,  a 
data  transfer  problem  must  be  faced. 

In  visualizing  three-dimensional  designs,  such  as  mechanical  assemblies, 
architectural  entities,  naval  and  aerospace  vehicles,  a  simple,  uniform 
colored  background,  a  simple  light  reflection  algorithm  and  relatively 
crude  surface  approximations  are  usually  adequate.  Studies  of  the  image 
data  structure  of  a  raster  scan  representations  of  such  scenes  show 
three  main  features  [6]: 

1.  Long  strings  of  uniform  colors,  mainly  the  background; 

2.  Shorter  strings  of  colors  which  represent  the  main  features  of  the 

design  object; 

3.  Many  very  short  strings,  most  only  a  single  dot,  which  give  the 

details  of  the  surface,  the  minor  features. 
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Taking  advantage  of  these  image  characteristics  one  can  build  relatively 
inexpensive  encoding  and  decoding  machines  which  will  reduce  the  amount 
of  data  required  for  the  description  of  acceptable  images  by  large 
factors;  a  data-compression  of  10  to  15  is  achievable.  Not  only  is  the 
image  buffer  reduction  significant,  but  more  importantly  it  is  possible 
to  maintain  an  image  flicker  -  free  without  recourse  to  prodigious 
computer  I/O  rates.  Thus  robot  manipulator  actions  can  be  viewed  using 
only  normal  input/output  channel  speeds  on  machines  such  as  VAX 
computers.  Further  reductions  in  image  storage  volumes  and  consequent 
amelioration  in  data  transfer  rates  can  be  achieved  by  considering  the 
scan-line  to  scan-line  coherence,  but  more  processing  will  be  involved. 

Various  simple  data  compaction  schemes  exist  which  are  particularly 
useful  for  managing  images  arising  in  mechanical  design.  Some  of  the 
image  generation  problems  are  now  being  addressed  by  special  display 
management  chips  [33.  When  the  image  is  moving,  as  in  a  movie  sequence, 
there  is  'motion  blur',  which  may  drastically  decrease  aliasing  problems 
in  single  images.  Special  display  pipelines  will  handle  specific  types 
of  display  tasks.  Examples  are  clippers  (for  object  collision  detection) 
element  generators  (for  molecular  models)  and  texture  generators  (for 
terrain  models  and  architectural  tasks).  All  of  these  are  'naturals'  for 
VLSI  implementations.  However,  before  any  of  these  can  be  constructed, 
the  algorithms  must  be  fully  understood  and  thoroughly  tested.  The 
optimization  for  'acceptable'  image  quality  can  be  made;  this  step 
usually  involves  an  assessment  on  how  much  one  can  'get  away  with'  in 
trading  motion  for  clarity  of  static  images. 

Finally  the  computations  required  for  detailed  image  presentation  seem 
to  mandate  supercomputer  usage  [4],  Currently-used  algorithms  are  poorly 
suited  to  supercomputers'  vectorization  requirements.  A  critical  view 
needs  to  be  taken  at  re-casting  the  display  calculations  into  forms  more 
adaptable  to  vector  machines.  One  of  the  most  compute-intensive  tasks 
in  computer  -  graphics  today  is  the  rendering  of  good-looking  curved 
surfaces  by  means  of  detailed  calculations  of  the  color-shade  of  the 
surface  [8],  most  often  by  the  technique  of  ray-tracing.  A  brute-force 
application  of  the  commonly-used  techniques  of  ray-tracing  shows  a  very 
high  degree  of  non-uniform  calculation  steps.  Basically  each  ray  will 
interact  with  the  scene  differently:  after  all,  this  is  the  essence  of 
rendering  the  image  in  interesting  detail.  However,  a  re-thinking  of 
the  procedure  in  terms  of  a  particular  supercomputer  structure  has 
yielded  great  savings  in  computational  times  and  cost,  albeit  at  an 
expense  of  significant  increases  in  the  number  of  calculations  required 
and  the  amount  of  storage  employed  [53. 

The  need  to  generate  good  graphic  outputs  from  application  programs  has 
spurred  the  development  of  specialty  display  processors  and  display 
pipelines.  Virtually  every  major  IC  vendor  has  made  significant  efforts 
to  provide  devices  with  much  increased  performance.  Noteworthy  among 
these  is  AMD,  National  Semiconductor,  Weitek  and  NEC.  Specialty 
display  systems  based  on  novel  architectures  are  Silicon  Graphics's 
IRIS  and  tiling  engines  now  becoming  available.  Gouraud  shading  can  be 
included  in  tiling  devices  which  can  keep  up  with  real-time  6C  Hz 
displays,  usually  updated  through  a  double  buffering  scheme  at  30  Hz  or 
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less.  Another  significant  development  is  the  emergence  of  graphics 
controllers  directed  towards  video  RAMs.  Here  the  familiar  BitBlt  pixel 
operations  are  extended  to  blocks  of  pixels,  usually  referred  to  as 
PixBlt  operations.  As  yet,  almost  all  these  engines  operate  on  2-D  data, 
reflecting,  perhaps  the  current  graphics  standards  thinking  in  CORE  and 
Graphics  Kernel  Systems  (GKS).  With  the  current  efforts  in  PHIGS,  much 
more  three-dimensional  graphics  will  become  commonplace.  Modeling  and 
rendering  curved  surfaces  with  dedicated  hardware  often  run  into  numeric 
round-off  problems  and  require  inordinate  numeric  accuracy.  For  a  worst- 
case  study  of  8k  by  8k  images  of  quadratic  surfaces  the  control  of 
accuracy  can  lead  to  8  more  than  80  bits  in  the  internal  registers  [7.] 
Of  course  further  work  will  lead  to  a  re-thinking  of  this  process  also 
and  to  simpler  hardware.  Even  with  the  worst-case  it  is  possible  today 
to  put  a  few  of  the  quadratic-rendering  engines  on  a  single  VLSI  chip. 

With  the  general  trend  of  decreasing  hardware  costs  the  currently- 
touted  devices  will  spawn  standardized  architectures  for  the  display 
task.  In  turn  the  display  process  will  have  to  be  re-visited  to 
ascertain  how  it  can  be  best  mapped  to  high-performance  cheap  hardware. 
Moving  image  sequences  can  demonstrate  that  the  most  beautiful  images 
are  not  necessarily  the  most  useful  ones.  We  can  expect  progress  in 
devices  just  for  such  images. 
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DEVELOPING  A  STANDARD  TAXONOMY  OF 
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PEORIA,  IL  61625 

This  paper  presents  a  view  of  a  possible  future  direction  and  environment  for 
the  development  and  use  of  computer  software.  The  paper  presents  a  view  of  this 
software  engineering  discipline  and  activity  within  a  future  environment  which  is 
guided  and  influenced  by  consensus  standards.  In  particular,  a  view  of  a  draft 
standard  taxonomy  of  software  engineering  standards  is  presented  together  with  a 
view  of  the  utility  of  that  taxonomy  in  classifying  standardization  efforts  in 
software  engineering  environments. 

A'  taxonomy  can  be  described  as  a  scheme  which  divides  a  body  of  knowledge  and 
explains  the  relationships  among  the  pieces  and  is  used  for  classifying  and  under¬ 
standing  the  body  of  knowledge.  In  this  case  an  IEEE  Computer  Society  (IEEE-CS) 
project,  P1002,  was  authorized  by  the  Subcommittee  on  Software  Engineering  Standards 
(SESS)  of  the  Technical  Committee  on  Software  Engineering  (TCSE) .  A  key  goal  among 
the  goals  of  this  project,  was  to  apply  this  taxonomy  concept  in  order  to  develop 
a  standard  taxonomy  of  software  engineering  standards.  More  specifically  the  goals 
of  the  P1002  project  included: 

To  provide  a  comprehensive  scheme  for  classifying  software  engineering  stan¬ 
dards,  recommended  practices,  and  guides. 

To  provide  a  framework  for  identifying  the  need  for  new  software  engineering 
standards,  recommended  practices,  and  guides. 

To  present  a  comprehensive  scheme  for  analyzing  a  set  of  software  engineering 
standards,  recommended  practices,  and  guides  appropriate  for  a  given  industry,  com¬ 
pany,  project,  or  particular  work  assignment. 

To  present  a  framework  for  comparing  sets  of  software  engineering  standards, 
recommended  practices  and  guides  to  support  the  selection  of  the  most  useful  set 
for  a  particular  software  product. 

Particularly  the  goals  were  relative  to  SESS  standardization  projects  within  IEEE- 
CS,  but  the  goals  were  seen  to  have  potential  application  more  generally. 

The  P1002  project  produced  a  Draft  Standard  Taxonomy  for  Software  Engineering 
Standards  which  is  currently  being  ballotted  within  IEEE.  This  software  engineering 
standards  taxonomy  is  organized  into  three  parts  (1)  a  partition  of  standards, 

(2)  a  partition  of  software  engineering,  and  (3)  a  framework  which  relates  to  the 
two  partitions  to  achieve  a  partition  of  software  engineering  standards.  The  par¬ 
tition  of  standards  is  organized  by  type  of  standards,  namely  profession,  product, 
and  process  standards.  The  partition  of  software  engineering  is  organized  by  func¬ 
tion  and  lifecycle.  The  taxonomy  framework,  as  it  is  applied  to  achieve  the  coals 
of  the  project,  is  simply  a  matrix  view  in  which  the  two  partitions  are  brought 
face-to-face. 

The  thesis  of  this  paper  is  that  we  can  have  an  optimistic  view  of  the  future 
of  computer  software.  This  optimistic  view  is  based  on  ideas  that, 


1.  Standards  will  be  viewed  as  important  and  will  be  a  successful  "way  of  do¬ 
ing  software." 

2.  The  requirements  of  society  and  profession  will  be  met. 

3.  Consensus  standards  will  be  looked  upon  more  positively  in  the  market 
place. 

4.  Standards  will  be  perceived  as  an  aid  rather  than  an  impediment  to 
software  development. 

Additionally  it  is  based  on  the  idea  that  the  work  of  project  P1002  meets  its  goals 
in  that  it  provides  a  framework  within  which  software  engineering  standards  can  be 
developed  and  used  in  order  to  guide  the  construction  of  software  that  will  satisfy 
the  increasingly  complicated  and  complex  use  of  computers  in  present  and  future 
society. 

In  order  to  show  the  utility  of  the  work  of  the  P1002  project  the  presentation 
of  this  paper  shows  the  application  of  the  draft  standard  taxonomy  in  classifying 
already  existing  standards  and  potential  standards.  A  collection  of  public  stan¬ 
dards  available  through  trade  associations,  government  agencies,  national  societies 
other  than  IEEE  is  classified  by  means  of  the  taxonomy  framework.  In  a  separate 
application  the  IEEE  software  engineering  standards  which  have  been  already  de¬ 
veloped  and  have  passed  consensus  approval  as  well  as  those  potential  standards 
under  development  as  IEEE-CS  SESS  projects  are  classified  by  means  of  the  draft 
taxonomy.  With  this  classification  the  software  engineering  standards  activity 
within  IEEE  is  evaluated  using  the  taxonomy  framework. 

Appreciation  is  expressed  to  the  P1002  working  group  and  its  product  without 
which  this  paper  could  not  be  written  or  presented.  In  particular  appreciation 
is  expressed  to  P1002  working  group  chairman,  Leonard  Tripp,  for  his  work  in  the 
classification  examples  shown.  The  Draft  Standard  Taxonomy  for  Software  Engineer¬ 
ing  Standards  is  currently  in  ballot.  Copies  of  the  draft  taxonomy  are  available 
from  John  W.  Horch 

M.S.  178 

Cummings  Research  Park 

Teledyne  Brown  Engineering 

Huntsville,  AL  35807 

Members  and  Affiliate  Members  of  IEEE  are  encouraged  to  evaluate  and  vote  on  the 
draft  standard  that  is  the  work  of  the  P1002  project. 


SAGE 

THE  CLEMSON  UNIVERSITY  SYSTOLIC  ARRAY  GENERATING  ENGINE 


Roy  P.  Pargas  and  Keith  R.  Allen 
Department  of  Computer  Science 
Clemson  University,  Clemson,  SC  29634-1906 


ABSTRACT 

This  paper  describes  the  Clemson  University  Systolic  Array  Generating  Engine 
(SAGE).  SAGE  allows  a  user  to  key  in  a  nested  loop  algorithm  in  a  high-level  lan¬ 
guage  and,  for  certain  types  of  algorithms,  such  as  convolution,  matrix-vector  multi¬ 
plication,  matrix-matrix  multiplication,  polynomial  evaluation,  and  linear  recur¬ 
rences,  automatically  generates  systolic  array  solutions.  The  paper  concludes  by 
listing  future  plans  for  SAGE,  ana  some  thoughts  on  possible  directions  in  the  auto¬ 
matic  generation  of  systolic  arrays. 


INTRODUCTION 

Designing  systolic  arrays  to  implement  algorithms  has  traditionally  been  a  time- 
consuming  process.  The  problem  does  not  lie  In  algorithm  development;  the  algorithm 
being  implemented  at  the  time  a  systolic  array  is  designed  is  typically  well- 
understood.  Rather,  the  major  portion  of  systolic  array  development  time  is  devoted 
to  the  often-tedious  process  of  mapping  data  flow  in  the  algorithm  onto  a  regularly- 
interconnected  array  of  computational  cells.  To  be  successful,  the  designer  must 
accomplish  this  mapping  in  such  a  way  that  the  data  values  associated  with  the  compu¬ 
tation  enter  and  move  through  array  cells  in  a  very  orderly  (i.e.  "systolic")  fash¬ 
ion. 

Until  recently,  designing  systolic  arrays  has  been  approached  in  an  ad  hoc,  seat- 
of-the-pants  manner.  Research  in  the  past  four  years  has  shown,  however,  that  it  is 
possible  to  substantially  automate  the  process  of  designing  systolic  arrays,  for  cer¬ 
tain  classes  of  loop  algorithms  [1,2, 3, 4, 5, 6, 7, 8] .  Lam  and  Mostowf2]  describe  an 
experimental  prototype  program,  dubbed  Sys,  which  generates  systolic  designs  by 
applying  a  series  of  transformations  to  a  source  algorithm  which  has  been  annotated 
with  advice  as  to  how  to  allocate  operations  to  hardware  cells.  The  authors  argue 
that  their  transformational  method  has  advantages  for  handling  complex  algorithms, 
and  for  verifying  correctness  of  designs  obtained. 

In  a  paper  by  Li  and  Wah[3],  the  problem  of  designing  a  one-  or  two-dimensional 
systolic  array  for  computing  certain  recurrences  is  formulated  as  a  linear  program¬ 
ming  problem.  Characteristics  of  systolic  arrays  are  parametarized  by  data  flow 
velocities,  spatial  distribution,  and  periods  of  computation.  Constraint  equations 
are  then  functions  of  these  parameters,  and  linear  programming  is  employed  to  derive 
optimal  designs  for  various  objective  functions  of  computation  period  and  number  of 
processing  cells. 

A  technique  for  generating  systolic  designs  from  a  "space-time  representation" 
for  a  recurrence  algorithm  is  described  by  Steiglitz  and  Capello[7].  First  the  algo¬ 
rithm  is  imbedded  in  N-dimensional  space  in  a  graph  which  embodies  dataflow  dependen¬ 
cies  of  the  algorithm,  and  it  is  imagined  that  the  entire  algorithm  is  computed  in 
parallel,  at  many  points  in  space,  at  a  single  instant  of  time.  To  generate  various 
systolic  designs  for  the  algorithm,  the  space-time  representation  is  typically  oper¬ 
ated  upon  by  an  affine  transformation,  and  a  spatial  dimension  of  the  image  is 
''traded"  for  the  time  dimension. 

Hornick[8]  describes  a  class  of  transformations  on  systolic  networks  which  alter 
network  topology  while  preserving  timing  of  its  computations.  Such  transformations 
are  used  to  demonstrate  the  equivalence  of  existing  designs,  or  to  obtain  new  designs 
from  an  existing  design. 

The  basic  method  underlying  SAGE  is  based  on  work  first  done  by  Moldovan [4,5] . 
Allen  and  Pargas[l]  undertook  a  case  study  of  this  method  and  showed  how  to  use  the 
method  to  generate  various  designs  for  convolution  and  matrix  multiplication.  The 
same  method  was  extended  by  Moldovan  and  Fortes[6]  to  deal  with  the  problem  of  parti¬ 
tioning  a  large  nested  loop  algorithm  and  mapping  it  onto  a  small  systolic  array,  in 
such  a  way  that  accuracy  or  results  is  not  compromised,  and  communication  overhead 
between  subproblems  is  minimized.  They  developed  a  software  package  called  ADVIS  to 
aid  in  implementing  their  partitioning/mapping  process. 


BRIEF  OVERVIEW  OF  SAGE 

SAGE  (Version  1.0)  is  written  in  Pascal  and  runs  on  an  IBM-PC.  The  program 
allows  the  user  to  key  in  a  nested  loop  algorithm  in  a  high-level  language,  and  first 
analyzes  the  source  loop  algorithm  to  show  dependences  among  its  variables.  If  analy¬ 
sis  of  these  dependences  shows  that  a  systolic  solution  is  feasible,  the  user  may 
proceed  to  ask  for  one  or  more  solutions  to  be  generated. 


The  solution  space  of  systolic  designs  for  a  given  loop  algorithm  is  usually 
infinite,  so  SAGE  provides  the  user  with  various  means  for  guiding  the  solution 
search  so  as  to  produce  only  those  solutions  which  are  "reasonable",  or  "interest¬ 
ing".  A  typical  criterion  governing  the  user's  interest  may  be,  for  example,  the  way 
in  which  tne  systolic  computation  interfaces  with  the  larger  computation  in  which  it 
is  imbedded.  Once  a  solution  of  interest  is  generated,  SAGE  allows  the  user  to  see 
finer  details  of  the  systolic  design  by  single-stepping  through  the  computation  and 
displaying  cell  locations  of  variables  at  each  time  step. 

The  loop  algorithms  must  satisfy  certain  requirements: 

1.  They  must  be  nested  loops,  two  or  three  deep. 

2.  Only  the  innermost  loop  may  contain  an  assignment  statement. 

3.  Only  one  assignment  statement  is  allowed. 

Although  the  requirements  appear  to  be  somewhat  restrictive,  we  have  found  that 
several  different  types  of  loop  algorithms  satisfy  these  requirements.  These  include 
loop  algorithms  for  convolution,  matrix-matrix  multiplication,  matrix-vector  multi¬ 
plication,  polynomial  evaluation,  and  the  solution  of  recurrences.  For  each  algo¬ 
rithm,  multiple  solutions  have  been  generated. 

Briefly,  SAGE  searches  for  a  systolic  array  solution  by  performing  three  steps. 

1.  Determine  the  dependence  vector,  X,  for  each  of  the  variables  used  in  the  loop 
algorithm. 

2.  If  the  dependence  vector  X  is  constant,  find  a  transformation  matrix  T  such 
that  TX=V  and  the  first  component  of  the  image  vector  V  (i.e.,  V[l])  is  pos¬ 
itive.  The  reason  for  this  is  that  the  first  component  of  V  is  interpreted  as 
the  "time"  component  of  the  systolic  array.  The  restriction,  in  effect, 
requires  that  time  only  go  forward. 

3.  Build  the  systolic  array  from  information  gained  by  analyzing  the  vectors  V, 
and  by  applying  T  to  the  loop  indices  of  tne  source  algorithm. 

SAGE  is  entirely  menu-driven.  Its  main  menu  includes  the  following  user  options. 

1.  Collect  loop  information.  The  user  enters  information  such  as  loop  limits  and 
array  variable  names  and  indices,  as  given  by  the  source  loop  algorithm,.  As 
soon  as  the  information  is  complete,  SAGE  determines  whether  dependences  among 
the  variables  are  constant,  assuring  a  systolic  array  solution.*  If  a  non¬ 
constant  dependence  is  found,  the  user  may  still  be  able  to  develop  a  solu¬ 
tion,  although  one  is  not  guaranteed. 

2.  Display  loop  information.  Dependence  vectors  may  be  inspected  by  the  user. 

3.  Matrix  selection.  After  the  user  specifies  upper  and  lower  limits  on  the  val¬ 
ues  of  the  transformation  matrices,  SAGE  conducts  a  search.  Each  suitable 
matrix  found  is  displayed  together  with  information  on  the  systolic  array  that 
it  generates.  The  information  includes  the  number  of  systolic  cells  used  and 
the  number  of  time  steps  required  by  the  systolic  array  to  execute  the  algo¬ 
rithm.  The  user  has  tne  option  of  continuing  the  search,  forward  or  backward, 
or  of  selecting  a  specific  matrix  for  tracing. 

4.  Tracing  systolic  array  execution.  The  user  may  single-step  through  the  sys¬ 
tolic  array  solution,  either  forward  or  backward.  The  variables  are  displayed 
moving  from  cell  to  cell.  The  user  may  select  which  variables  to  display,  and 
which  to  temporarily  mask  out,  allowing  the  user  to  focus  on  the  movement  of 
only  selected  variables. 

The  interested  reader  is  referred  to  reports  by  Allen  and  Pargas[l,9]  in  which 
SAGE  is  described  in  greater  detail. 


CONCLUSION  AND  FUTURE  WORK 

Initial  experience  with  SAGE  indicates  that,  for  appropriate  source  algorithms, 
the  search  for  a  systolic  array  solution  becomes  almost  trivial.  The  user "no  longer 
faces  the  task  of  tracing  the  movement  of  the  data  through  the  ceils.  All  of  the 
tedious  work  is  done  by  the  program.  We  continue  to  experiment  with  SAGE.  In  par¬ 
ticular,  we  continue  to  search  for  other  problems  whose  algorithms  satisfy,  or  may  be 
modified  to  satisfy,  the  requirements  described  above. 

Work  also  continues  on  SAGE  itself.  Future  versions  will  incorporate  at  least 
the  following  features. 

1.  Multiple  assignment  statements  in  the  innermost  loop. 
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2.  Greater  user  control  over  characteristics  of  the  transformation  matrices 
returned  by  SAGE.  The  user  will  be  able  to  specify,  for  example,  that  SAGE 
return  only  systolic  solutions  in  which  a  specific  variable  remains  fixed  in 
the  cells,  or  that  a  variable  moves  right  to  left,  or  that  a  variable  moves  at 
a  specified  speed. 

3.  Equivalence  classes  of  solutions.  Hornick[4l  describes  a  method  for  determin¬ 
ing  a  family  of  systolic  solutions  to  a  problem  given  one  solution.  Future 
versions  of  SAGE  will  provide  the  user  with  the  option  of  searching  for  (or 
ignoring)  members  of  the  equivalence  class  of  a  known  and  specified  solution. 

Finally,  research  in  the  general  area  of  automatic  generation  of  systolic  arrays 
continues  at  Clemson  University.  As  experience  with  the  program  grows,  we  plan  to 
develop  a  design  methodology  for  the  construction  of  systolic  arrays.  The  goal  will 
be  a  general  and  systematic  method  for  the  generation  of  systolic  arrays  for  a  wide 
variety  of  loop  algorithms. 
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introduction 

The  data  flow  model  of  comoutation  nas  soawneo  diverse  approaches  to  tie 
strategies  associated  with  concurrent  processinu.  ir>  Particular,  tne 
distruouted  computer  svstein  (DCS)  is  an  important  class  of  architecture  as  a 
strategy  for  concurrent  processing.  Problems  associated  with  the  DC 8  concept 
include  the  decentra  I ized  control  and  tasK-to-resource  assianment.  Tne 
problems  stem  form  tne  forward  flow  of  both  data  and  control /ass ionment 
information  or  tokens.  Scheduling  or  the  data  flow  become  s  d  i  f  f  :  so  1  r  ■  ■"  :: 

as  a  l  gor  >  thm  inf  .rmaC  ion  is  decomposed  to  meet  s  Part  •  cu  >  ar  ar  :  sects’  e 
strategy.  Hence,  trie  imperfect  match  up  between  actual  arcni  teutur  e  da  :a  * i :w 
and  desirable  algorithm  data  flow  leads  to  difficulty  in  tne  analytic  of  ‘ns 
execution  behavior  of  the  decomposed  algorithm  in  the  concurrent  processing 
env i ronment . 

This  oaoer  proposes  a  modeling  strategy  which  addresses  5  the  data  flow 
consistency  between  algorithm  and  architecture  in  a  concur r enf  process • no 
environment.  Algorithms  of  interest  are  decomposed  into  primitive  tus*- ;  .  ■  ■_  • 

are  considered  to  be  comoutat  ional  I  v  intensive.  ns  is  mode  i  ;  based  sc.-n  a 
Petri  net  description  of  tne  the  data  management  for  eac.n  node  of  fe  cats  - 
graph.  Tne  node  Petri  net  model  is  actually  represented  a  iiarc  e_  v  uon 
( NMG ) .  Subsequent  interconection  of  these  NMGs  cor  r  esovrid :  rig  to  tne  a .  got  it 
data  flow  graph  leads  to  the  computation  marked  qr  aoh  (CMb)  of  tne  a  :cr  i  —  . 

fries  trie  -NMG  is  a  primitive  suucit  aran  or  the  c.M(j.  Two  imp-'*  tant  feuVu'*-*-  •  * 

mode  i  is  that  data  flow  is  preserved,  and  control  f  sow  is  i  den'  :  f  i  cc.  aro 
managed  w:th  separate  tc-eris.  further .  the  mode*  .-.•ads  rc  a  no  si...: 
ba  i  i  i  ues'ori  .f  a  data  driven  arch  i  t  eutu!~e.  i  t  ;  i.suroo  too  v 

af  i  tecr  are  is  it  at  i  c  ih  that  on!  /  one  u roc es  ;  oi  t  f  cnc.  ioha  •  ..."  .  .  . 

Scant  late  one  node  at  a  time,  of  theoretical  into'  sot  is  t:  at  t  od.- 
promote  better  under  standing  of  how  to  decompose  aioor  i  •  hms  to  "a- in  ;je 
processing  concur rnev.  Such  issues  as  thruout.  initial i bat i on .  oeadicc- .  e 
Can  be  interpreted  in  a  graph  context. 

None  Narked  graph  Model 

The  development  of  the  marked  graph  model  of  the  2-cort  comc-ut  jfi<y  ”•  i  • 
motivated  uv  trie  following  observations  and  assumot ions . 

1.  Functional  units  are  processors  with  local  memory  for  pi  c  gram  stor  age  jr>r: 
temporar v  input  and  output  data  containers. 

a.  Trie  data  memor  v  is  global  to  all  functional  units.  The  inputs  as  so. : ateo 

with  each  computational  graph  node  (process)  correspond  to  fixed  data 

containers  in  trie  global  data  memory. 

:  he  .c.iPutat  I  ona !  process  cannot  be  assigned  to  a  pi  oa;-  :  iu  :  ,-io  ..  .•  o  . 

i  functional  unit)  unless  a  functional  unit  is  ava  i  -able  and  all  incuts  t.r-e 

computation  are  aval 'able. 


:v] 


i 


r  -  a  . 


NM,j  1' 


1m  c  .  r  >t 
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>V' 
.  v  *. 


ii-f  inout-i  Duffer  full  (data  avail able) 

11- e  Input- 1  Duffer  empty  (status) 

12- f  InDut-2  buffer  full  (data  available.' 

12-e  inout-2  buffer  emotv  (status) 

DR  Data  reaav  in  functional  unit 

Nb  Process  not  busy 

PR  Process  or  computation  is  complete 

NMG  edoe  labels  (conc'd) 

Ul-e  Output-1  buffer  einotv  (status) 

'Jl-f  Output-!  buffet  fuM  (data  available) 

ud-e  Output -2  puffer  emotv  (status) 

u_-r  Output - 2  Duffer  re  :  (cat a  available) 

nodss  • 

A.  Transition  A  fires  when  I  1 . 1 2— f  .and  NB  tokens  are  available. 
Implicit  in  the  firing  rule  is  priority  assignment  ana  ava  i  I  ao  i  I  :  ’:•/ 
a  functional  resource  to  assign  to  the  particular  task. 

B.  Transition  B  fires  when  DR  token  is  available.  ihis  transit1 on 
corresponds  to  the  actual  computation  event. 

C.  Transition  C  fires  when  01 .02-e.  and  Pk  tokens  a* e 
ava i I ab 1 e . 

The  computational  marked  graph  LMG)  of  a  Particular  algorithm  is  ..■sms.-; 
of  tne  interconnection  of  NMGs  bv  joining  the  various  input  a’ cs  tc  tne 
correspond  i  na  output  arcs  of  predecessor  NMGs.  Data  few  a-  j  :-jffer-*  - 
conditions  are  represented  uv  directed  arcs  corresponding  to  the  dt  re  .teu  /.a 
flow  or  trie  original  computational  f  icw  araon.  However  .  ma:  rings  3  ate:: 
witn  status  and  control  are  descr  ioed  on  directed  arcs  in  ve  1  i  •  ect  '  .  :■ 
opposite  to  the  data  few.  -his  observation  leads  to  a  Con-.  :  stem:  >■  ; 

data  flow  and  rules  to  be  followed  uv  any  data  jr  i en  .afot:i  test  u**e  f  •  -c  e 
the  process.  The  algorithm  graph  and  ccinoutat iona !  mar ted  gr  aon  are  do  sc:  :b 
in  more  detail  and  are  illustrated  in  figures  2  and  j.  respective!/. 

data  Driven  Architecture 

Tne  architecture  must  allow  the  execution  of  the  lmg  whose  nodes  are 
enabled  bv  data  and  fired  bv  the  avai  labi  I  i  tv  of  a  functional  unif.  A 
resource  architecture  is  shown  in  Figure  4.  The  architecture  is  a  uj'.a-dr  ■ 
structure  which  is  a  natural  conseauence  of  meeting  the  reau i rements  o'  tne 
Petri-net  model. 

The  candidate  architecture  consists  of  four  differ  eric  units  ana  two 
interconnecting  buses.  The  hardware  units  consist  of  k-functionai  units  (Ft 
one  high  speed  data  memory  (DM),  one  (or  more)  I/O  units  ard  a  t  >en  manage'* 
(  M  .■  .  Control  snu  status  inrormatiun  are  passed  between  tne  1  cs  an.:  *r.  '  >A 
tne  token  bus.  Data  packets  are  passed  to  and  from  tne  var  iou.  us  and  the 
bv  wav  of  the  data  bus.  The  1/0  units  obey  all  protocol  rules  of  the  Tm  and 
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Summary 


This  paper  proposes  a  methodology  for  quick  and  not  dirty  interpreter  imple¬ 
mentations  in  high  level  languages.  It  is  based  upon  a  formal  and  operational 
description  of  stack  manipulations  using  the  mechanism  of  exceptions.  This  formal¬ 
ism  provides  a  powerful  tool  to  write  or  extend  an  interpreter  from  its  meta-circular 
definition  -  e.g.  Lisp  or  Prolog. 
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1.  Introduction 

The  exception  mechanism  is  present  in  many  programming  languages.  The 
conditions  of  PL/1  [PL/178]  has  been  the  first  but  not  the  least  Qu  [Liskov79]  and 
Ada  [Ansi83]  own  less  powerful  statements  of  that  type.  Recently  the  problem  of  its 
implementation  and  use  in  Modula-2  [Hopkin86]  and  Pascal  has  raised.  Common- 
LISP  [Steele  84]  does  not  (presently)  include  this  feature.  Note  that  the  three  last 
languages  are  powerfull  enough  to  allow  simulation  [Testard-Vaillant86], 

1.1.  Exceptional  situations 

We  call  exceptional  those  situations  in  which  the  interpreter  has  to  walk  down 
the  recursion  stack  -  i.e.  those  cases  when  the  interpretive  process  follows  no  longer 
a  strict  stack  discipline. 

There  are  basically  three  of  them  : 

1)  The  Garbage  Collector,  which  has  to  take  into  account  not  only  past  bind¬ 
ings  (shallow)  but  also  past  values  in  the  recursion  stack. 

2)  The  non-local  exit  mechanism,  which  is  clearly  exceptional.  Note  that  the 
whole  of  error  handling  can  be  brought  under  this  heading. 

3)  The  tail-recursive  situations  checking  [Greussay77,Saint-James84],  in 
dynamically-scoped  LISP  interpreters  which  requires  a  very  precise  examina¬ 
tion  of  the  stack  to  a  depth  unknown  in  advance. 

Now,  these  three  points  cannot  be  easily  treated  in  isolation.  Actually,  in 
current  practice  of  interpreter  writing,  they  pervade  virtually  the  whole  code  :  every 
single  module  has  to  beware  of  some  aspects  of  "the  gang  of  three". 

Our  gang  of  three  is  not  absolutely  necessary  but  A. I.  applications  often  spend 
lavishly  both  run-time  and  memory-space.  The  salient  point  of  the  third  part  is  that 
it  ensures  the  user  that  the  memory-space  devoted  to  the  recursion  stack  will  be  as 
small  as  possible.  This  encourages  the  use  of  recursion  which  becomes  an  efficient 
tool  for  programming  (from  a  run-time  point  of  view)  and  allows  extension  of  LISP 
toward  multi-processing. 


1.2.  Connections  between  exceptions  and  exceptional  situations 

Generally  speaking,  one  could  say  that  a  LISP  system  is  made  up  of  two  pans: 
The  normal  one,  which  is  usually  specified  by  some  metacircular  model 
[Riviferes84]  and  the  exceptional  one,  for  which  no  formal  description  is  given. 

Unfortunately,  the  normal  and  exceptional  parts  of  every  module  are  treated 
together,  in  an  all  but  clearly  organized  way.  As  a  consequence  of  this  state  of 
things,  some  unpleasant  features  are  commonly  accepted  such  as  the  destruction  of 
the  stack  when  a  thrown  value  doesn’t  find  its  catch  [Chailloux84]. 

All  the  sequencing  functions  (progn,  or,  and  ...)  have  to  take  precautions  i.e.  to 
free  the  top  of  the  stack  to  allow  the  tail-recursion  checking  [Saint- James84],  This 
tangled  way  of  writing  practically  prevents  the  verification  that  the  normal  part  is 
indeed  faithful  to  the  metacircular  definition  -  not  to  mention  inquiries  regarding  the 
exceptional  part 

We  argue  that  the  present  situation  is  largely  due  to  the  illusion  that  excep¬ 
tional  features  can  be  treated  in  an  off-hand  way,  by  immoderate  use  of  the  facili¬ 
ties  permitted  by  assembly  language  programming,  which  does  not  encourage  a  sys¬ 
tematic  point  of  view.  On  the  contrary,  we  adopt  the  attitude  that  interpreter  writ¬ 
ing,  as  the  rest  of  systems  programming,  should  be  performed  in  high  level 
languages.  Furthermore  we  claim  that  high  level  languages  must  be  really  used  as 
high  level  language  i.e.  we  will  entrust  them  the  management  of  the  recursion  stack. 
This  clearly  necessitates  the  formalization  of  stack  manipulations. 

We  propose  the  notion  of  an  exception  as  the  adequate  tool  for  our  stack- 
perusing  tasks. 

1-  The  Garbage  Collector  deals  with  the  gc  exception  raised  by  cons. 

2-  The  dynamic  non  local  exits  deal  with  the  escape  exception  raised  by 

throw. 

3-  The  tail-recursive  situations  checking  deals  with  the  tail  exception  raise  by 

apply. 
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1.3.  Overview  of  the  interpreter 

A  LISP  interpreter  is  a  set  of  procedures.  These  procedures  are  twofold:  the 
body  directly  translated  from  the  metacircular  model  and  the  handlers  which  have 
to  handle  the  exceptional  situations.  Though  the  specification  of  each  exceptional 
situation  will  afford: 

1-  the  name  of  the  specialized  exception, 

2-  the  name  of  the  procedure  which  initially  raise  the  exception  and 

3-  the  kinds  of  handlers  the  set  of  procedures  needs  to  handle  this  peculiar 

exceptional  situations. 


2.  Exceptions  behavior 


Let  us  draw  a  parallel  between  exception  raising  and  message  passing  in 
actor-based  languages:  the  procedure  which  needs  to  modify  the  stack  or  only  to 
get  information  from  it  will  send  a  message  to  the  closest  (calling)  procedure  pre¬ 
cisely  by  raising  an  exception.  This  mechanism  does  not  have  to  destroy  the  stack 
except  if  it  is  explicitly  asked.  If  the  receiving  procedure  is  not  able  to  understand 
(handle)  the  message  it  will  be  forwarded  downto  the  closest  one  in  the  recursion 
stack  which  will  act  as  the  proxy  in  ACT1  terminology.  On  the  contrary  if  the 
receiving  procedure  owns  a  method  for  this  message  (a  handler  for  this  exception)  it 
can  either  kill  the  current  message  (exception),  propagate  it  i.e.  transmit  it  to  its 


own  calling  procedure  or  return  it  to  the  sender. 


3.  Fixing  the  stack  problem 


3.1.  The  Garbage  Collector 

Let’s  take  a  widely  used  Garbage  Collection  algorithm  -  e.g.  the  Mark  and 
Sweep  one. 
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The  Garbage  Collector  is  called  by  cons.  It  raises  the  exception  called  gc.  There 
are  two  kinds  of  handlers:  those  which  belong  to  a  procedure  which  used  cons  cells 
for  its  owns  and  others.  The  first  kind  have  to  mark  the  said  cons  cells.  Both  of 
them  have  to  propagate  the  gc  exception. 


3.2.  Non-local  exit 


The  non-local  exit  is  the  work  of  throw.  It  raises  the  exception  called  escape. 
There  are  three  kinds  of  handlers:  thoses  which  only  propagate  the  exception,  the 
one  which  unbind  what  needs  to  be  before  to  propagate  the  exception  and  the  one 
which  kill  the  exception. 


Apply  or  more  precisely  the  specialized  part  of  it  which  deals  with  X- 
expressions  owns  a  second  kind  of  handler  and  catch-all  clearly  owns  a  third  kind 
of  handler. 


3.3.  Iterative  interpretation  of  tail-recursive  calls 

This  is  a  well-known  problem  both  in  lexical  binding  context  [Steele76a, 
Steele76b,  Steele77]  and  in  dynamic  binding  context  [Saint-James84], 


The  tail-recursive  situation  checking  is  the  work  of  apply.  It  raises  the  exception 
called  tail.  There  are  three  kinds  of  handlers:  those  which  return  the  exception 
meaning  the  current  situation  is  not  tail-recursive,  those  which  propagate  the  excep¬ 
tion  meaning  the  current  situation  may  be  tail-recursive  and  that  which  can  kill  the 
exception. 


The  handler  of  cons  belongs  to  the  first  kind  whereas  the  progn’s  one  checks 
if  the  expression  that  it  is  evaluating,  is  the  last  one  to  decide  to  return  or  to  pro¬ 
pagate.  The  part  of  apply  specialized  in  X-expressions  (app!y_expr)  kills  the  excep¬ 
tion  after  the  updating  of  the  environment  if  the  current  call  is  identical  to  its. 


Improvements 

We  will  show  that  the  crossed  tail-recursive  call  is  nothing  but  the  addition  of 
a  new  exception  with  the  description  of  the  raising  function  and  the  kinds  of 
handlers  needed. 


The  tail-recursive  situation  checking  is  the  work  of  the  apply_expr.  It  raises  the 
exception  called  crossed.  There  are  three  kinds  of  handlers  which  are  basically  the 
same  as  the  tail  exception  except  for  the  third  kind.  Rather  than  kill  the  exception 
when  a  crossed  tail-recursion  has  been  discovered  the  handler  has  to  update  the 
environment  and  then  to  return  it.  Coming  back  to  apply_expr  the  exception  will 
be  killed  after  the  updating  of  the  current  X-expression’s  body. 


More  complex  cases  such  as  left  associative  envelope  -  e.g.  +  and  *,  are 
decribed  in  the  same  way. 

4.  General  view  of  portable  (and  disposable)  interpreters 

The  exception  mechanism  allows  the  formal  and  operational  description  of 
stack  manipulations.  This  ends  up  to  quick  and  not  dirty  implementations  of  LISP 
interpreters  in  high  level  languages.  Following  P.  Greussay  lGreussay77],  we  can 
now  assert  that  the  quicker  way  to  implement  an  efficient  Prolog  interpreter  (for 
example)  is  not  to  write  it  in  LISP  but  to  deduce  it. from  a  LISP  interpreter  clearly 
specified. 
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ABSTRACT 

In  this  paper  a  rule-based  Lisp  dialect  translator  (LDT)  is 
presented.  The  advantage  of  using  a  rule-based  system  is  to  allow  the 
user  to  supply  his  own  rules  for  translation  and  thus  the  translator 
can  be  considered  a  general  purpose  dialect  converter.  The 
translation  being  used  for  development  is  Franz  to  Common  Lisp. 


1  INTRODUCTION 
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2  RULE  FORM 


The  rules  are  in  a  paramodulat ion  form,  as  an  example  of  the 
one-to-one  substitution: 

The  relationship:  EQUAL  (add  :  +) 

The  Franz  function:  (defun  example_l  ()  (add. 2  3)) 

The  Common  function:  (defun  example_l  ()  (+  2  3)) 

Paramodulat ion  occured  from  the  "add"  term  into  the  "+"  term.  The  LOT 
shall  restrict  the  form  of  the  equality  literal  such  chat  the  free 
term  is  always  to  be  the  left-hand  side  of  the  equality  literal  anc 
the  into  term  is  always  to  be  the  right-hand  side  of  the  equality 
literal . 


3  TRANSFORMATIONS 

3.1  One-to-one  Transformations 


The  rule  form  for  one-to-one  transformations  is  the  simplest  to  derive 
with  the  new_expression  directly  replacing  the  old_express ion .  See 
the  previous  example. 


3.2  One- to-many  Transformations 

The  rule  form  for  one-to-many  transformations  involves  matching. 
This  is  done  by  using  variables  within  the  expressions.  The  variables 
in  different  equality  literals  are  completely  independent,  even  if 
they  have  the  same  name.  The  first  part  of  the  rule  will  allow  for 
single-atom  instantiation  or  multiple-atom  instant i ■  t ion .  Single-atom 
instantiation  will  be  specified  by  using  the  "?"  symbol. 

For  example:  match  '(?  x)  'hello 

will  cause  x  to  be  instantiated  to  "hello". 

Multiple-atom  instantiation  will  be  specified  by  using  the  "+"  symbol. 

For  example:  match  '(  +  x)  '(hello  world) 

will  cause  :<  to  be  instantiated  to  '’hello  world). 

The  second  part  of  the  rule  will  nave  the  var iaeles  rep  la  red 
the  instantiated  value,  recardless  of  either  circle-  ?r  cul * ir 1 e-  it 
inst a n t  i  a t  ion. 
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As  an  example  of  one-to-many  substitution: 

The  relationship:  EQUAL  (nequal  (?  x)  (?  y)  :  not  (equal  x  y) ) 

( 

The  Franz  function:  (defun  example_2  () 

(cond  ((nequal  'a  ’b)  (print  'a)))) 

The  Common  function:  (defun  example_2  () 

(cond  ((not  (equal  '.a  'b))  (print  'a)))) 


3.3  Many-to-many  Transformations 

This  type  of  translation  can  be  seen  as  a  combination  of 
one-to-one  and  one-to-many  translation.  With  operations  being 
performed  on  the  variables  and  a  direct  substitution  between  the  twc 
instruct  ions . 

As  an  example  of  many-to-many  substitution: 

The  relationship  : 

EQUAL  (appendl  (?  x)  (?  y)  :  append  x  (list  y) ) 

The  Franz  function: 

(defun  example_3  ()  (appendl  '(a  be)  'a)) 

The  Common  function  : 

(defun  example_3  ()  (append  ’(a  be)  (list  'a))) 


3.4  Special  One-to-many  Transfo.  ual ion 

To  translate  superparentheses  from  Franz  to  Common  Lisp  a  special 
type  of  translation  is  needed.  In  Franz  Lisp  a  right  superparenthesir 
is  represented  by  "  ] "  and  can  close  off  as  many  open  left  parentheS'-r. 
as  needed  until  the  end  of  the  function  is  reached  or  until  an  open 
left  superparenthesis  is  encountered.  A  left  superparenthesis  is 
represented  by  "["  and  closes  one  right  superparenthesis. 
Superparentheses  are  not  allowed  in  Common  Lisp  so  a  transformation  is 
needed  to  change  superparenthesis  to  parenthesis.  At  present  this 
substitution  takes  place  in  a  function  and  is  not  implemented  with  a 
rule. 

The  Franz  function:  (defun  example  4  ;  ' 

(  cond  7 1  r.u  11  >  '  >  1  pr  i no  ” he  1  lo ’’  ]  ' 

Trie  Common  function:  defun  example  4 
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Semi-Applicative  Programming 
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Novel  parallel  machines  are  being  designed  and  built  Amid  the  loud  applause  for  the 
ingenuity  of  the  ideas  and  implications  of  their  success,  we  also  hear  the  remark  "But 
how  are  you  going  to  program  such  a  beast0",  thereby  implying  that  the  software 
problem  remains  once  the  hardware  is  designed  Similarly,  several  attempts  are  being 
made  at  parallel  programming  language  design,  one  hears  the  remark  "What  we  really 
need  are  ways  of  thinking  and  problem  solving  that  incorporate  parallelism"  This 
reaction  stems  from  the  view  that  a  programming  language  is  merely  a  notation  and 
new  developments  in  programming  methodology  are  essential  for  the  proper  use  of  the 
next  generation  of  computers 

We  aim  at  developing  a  programming  language  and  programming  methodology  that  allow 
effective  use  of  medium-scale,  medium-grain  parallelism  support  correct  program 
development,  and  allow  effective,  error-free  control  of  program  behavior  through  a 
variety  of  means 

As  an  initial  set  of  problems  to  study  in  the  project,  we  are  investigating  search 
algorithms  (alpha-beta,  branch  and  bound,  backtrack),  constraint  propagation  and 
marker  propagation  algorithms,  constraint  satisfaction  and  relaxation  algorithms  and 
parsing  algorithms 


& 


The  full  length  version  of  this  paper  [9]  presents  the  results  of  study  on  parsing 
algorithms  for  context-free  grammars  We  start  with  the  specification  of  a  recognizer 
for  context-free  grammars  in  Chomskv  -  Normal  -  Form  (CNF)  and  derive  bv 
transformations  a  variety  of  different  purely  applicative  parallel  recognition 
algorithms  We  then  introduce  program  annotations  and  display  semi-applicative 
algorithms  In  one  algorithm  we  indicate  how  adaptive  scheduling  can  control  the 
behavior  exhibited  by  the  algorithm  Thus  we  hope  the  reader  observes  the  use  of 
transformations,  annotations  and  scheduling  as  means  of  controlling  program  behavior 

1  Resource  control  in  applicative  programs 

We  are  experimenting  with  a  programming  language.  SALT,  that  has  an  implicitly 
parallel  applicative  language  as  the  base  language,  and  then  we  introduce  constructs 
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that  inhibit  parallelism  The  base  language.  .'ALT.  uses  pure  LIoP  [7]  as  a  foundation 
and  blends  in  interesting  features  of  Prolog  [ 6 j  and  FP  [  1  ]  We  introduce  several 
techniques  of  controlling  the  behavior  of  functional  programs  without  changing  their 
meaning  or  functionality  (1)  program  annotation  with  constructs  that  have  benign 
side -effects,  (n)  program  transformation  and  <111 )  adaptive  scheduling  This 
combination  yields  us  a  semi  -  applicative  programming  language  and  an  interesting 
programming  methodology  Because  the  applicative  language  has  simple  clear  parallel 
semantics,  it  facilitates  doing  program  transformations  Because  the  expression  of 
concurrency  is  implicit,  this  provides  freedom  in  scheduling  tasks  and  thus  facilitates 
adaptive  scheduling.  We  believe  that  this  approach  combining  implicit  parallelism  and 
explicit  control  will  reduce  the  risk  of  introducing  bugs  in  the  process  of  tuning 
programs  for  performance,  as  well  as  providing  an  antidote  to  Amdahl  s  law 


1.1  Program  annotations 


We  have  developed  an  initial  design  for  an  annotation  language  called  PEPPER 
Annotations  in  PEPPER  include  precedence  control,  function  call/ result  caching  [4.  5], 
and  lazy  (or  demand-driven)  evaluation  Our  initial  design  document  describes 
PEPPER  annotations  and  gives  several  examples  of  their  use  A  key  feature  of  PEPPER 
annotations  is  that  their  introduction  will  not  alter  the  functional  value  of  a  program 
and  will  affect  only  its  run-time  behavior 


1.2  Program  transformation 

Program  transformation  [2,  3,  8]  of  the  source-level  algorithms  written  in  SALT  is 
essential  There  are  two  quite  different  reasons  for  considering  program 
transformation  Firstly,  given  that  PEPPER  annotations  are  to  be  added  to  the  text  of 
a  program,  it  is  evident  that  textually  different  but  functionally  equivalent  programs 
offer  different  opportunities  for  adding  annotations,  and  hence  provide  different 
opportunities  for  achieving  different  behaviors  A  programmer  can  make  full  use  of 
annotations  only  in  conjunction  with  the  ability  to  do  program  transformations 
Secondly  different  programs  yield  different  data  -  dependency  orderings  thus  allowing 
differing  amounts  of  concurrency  Thus  program  transformation  is  a  technique  for 
controlling  the  available  concurrency  in  a  program 


13  Adaptive  scheduling 


Adaptive  scheduling  can  yield  improved  behavior  with  repeated  runs  of  the  same 
program  Adaptation  requires  monitoring  and  measuring  run-time  characteristics  of 
SALT-*- PEPPER  programs  in  order  to  make  scheduling  decisions  dynamicallv  In  working 
with  virtual  parallelism,  the  program  only  expresses  'available  concurrency  without 
mandating  what  in  fact  will,  execute  concurrently  with  what  The  scheduler  converts 
the  available  concurrency  into  physical  concurrency  making  its  choices  for  running 
tasks  by  considering  available  resources,  resource  requirements  and  other  attributes 
of  tasks  The  scheduler  typically  is  a  heuristic-  procedure  and  does  not  vield  optimal 
behavior  in  all  cases  Thus,  another  avenue  open  to  the  programmer  is  to  tune  some 
parameters  of  the  scheduler  to  alter  the  behavior  of  the  program  We  feel  this  method 
of  control  is  indirect'  and  while  that  is  necessarv  it  1.'  not  likeiv  to  be  sufficient  It 
mev  be  useful  to  construct,  an  adaptive  scheduler  that  tunes  its  parameters  ba:  ed  on 
empirical  measurements 
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2  Research  Results:  Recognition  algorithms  for  CNF  grammars 

The  initial  specification  is  a  standard  mathematical  definition  for  what  a  recognition 
algorithm  should  do.  In  several  steps  of  transformation,  we  derive  a  variety  of 
parallel  recognizers  (see  Figure  1).  In  the  literature  there  are  two  well-known 
parsing  algorithms.  Cocke.  Younger  and  Kasami.  independently  reported  on  the 
development  of  a  bottom-up  parser  that  works  on  a  restricted  form  of  grammars 
called  the  Chomsky-Normal-Form  Earley  published  an  algorithm  that  works  with 
unrestricted  context-free  grammars  and  parses  an  input  string  in  a  top-down  fashion 
using  left-context  to  limit  search.  Most  of  us.  including  the  author,  have  admired  the 
cleverness  of  such  algorithms.  One  of  the  things  we  do  in  the  full  paper  is  to 
demonstrate  that  these  algorithms  have  real  close  affinities  to  each  other  and  that 
they  can  be  arrived  at  systematically  Similar  derivations  of  various  SORT  algorithms 
have  been  previously  presented  in  the  literature.  All  of  these  have  been  successful 
only  at  deriving  a  few  of  the  already  known  sort  algorithms,  none  of  these  derive  any 
new  SORT  algorithm  Can  one  readily  adopt  these  transformational  techniques  to 
derive  new  algorithms?  Our  attempt  hopefully  convinces  the  reader  that  new  parallel 
algorithms  can  be  derived  systematically. 
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Experimenting  with  parallel  programming  in  logic 
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We  present  an  extension  to  the  logic  programming  language  Prolog, 
SISPRO  (Super  'is*  Prolog).  This  extension  is  on  a  very  small  scale  but  is 
typical  of  other  efforts  in  this  area  to  extend  the  capability  of  purely 
logic  programs  to  handle  also  functional  constructs.  The  motivation  for 
efforts  in  this  direction  is  to  remove  obstacles  in  bringing  about  the 
further  development  of  a  very  promising  software  tool. 

The  particular  extension  to  Prolog  to  be  discussed  is  realized  by  the 
introduction  of  a  rewrite  rule  permitting  the  specification  of  functional 
terms  to  reside  side  by  side  with  logical  terms.  A  new  local  variable 
assignment  operator  is  also  introduced  to  take  the  place  of  the  conventional 
'is'  operator.  In  addition  to  assignment  to  a  local  variable,  results  of 
simple  arithmetic,  SIS,  the  new  operator,  can  also  bind  to  the  local 
variable  results  of  function  evaluation.  In  fact  there  is  no  restriction  on 
the  bindings  capabilities  of  SIS. 

The  implementation  of  SISPRO  was  achieved  by  the  creation  of  a  small 
interpreter  written  in  regular  Prolog.  This  is  but  a  small  example  of  the 
use  of  Prolog  as  a  language  for  the  creation  of  runnable  specification  for 
software.  The  SISPRO  interpreter  will  evaluate  functions  defined 
recursively  and  at  the  same  time  pass  regular  Prolog  terms  to  the 
Prolog  interpreter  for  conventional  evaluation. 

append( [ ] ,L,L) . 

append ( [ X | LI ] , L2 , [ X | L3 ] ) : —  append( [LI ,L2,L3] ) . 

Example  la.  Standard  Prolog  does  not  require  any  specific  variable  binding 
order,  thus  achieving  "multi  directionality" 


append([],B)  =>  B. 

append( [X|Y] ,B)  =>  ([X|  append(Y,B) ]) . 

Example  lb.  SISPRO  requires  that  the  input  variables  be  bound  at  the 
beginning  of  execution,  thus  achieving  "uni  directionality" 

As  technology  evolves,  there  is  greater  hope  to  free  the  software 
developer  from  the  Von  Neumann  model  of  computation.  Conventional  or 
imperative  programming  languages  are  intimately  tied  to  the  Von 
Neumann  model  of  computation.  On  the  other  hand,  declarative  languages 
such  as  Lisp  and  Prolog  have  their  origin  in  mathematical  formalism  which 
have  developed  totally  seperate  from  any  consideration  of 
implementation  by  computation  devices.  This  freedom  resulted  in  formalisms 
which  achieved  greater  experssive  power,  possibility  for  formal 
manipulation  and  ease  of  parallel  evaluation.  These  attributes  might  very 
well  influence  the  direction  of  future  computer  implementations. 
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Declarative  languages  have  taken  two  forms:  functional  with  lambda 
calculus  and  recursion  equations  systems,  and  logic  programming  based 
on  procedural  interpretation  of  the  first  order  predicate  calculus. 

Logic  programs  are  very  amenable  to  parallel  execution.  We  can  view 
the  computation  of  a  logic  program  as  a  process  of  constructing  a  proof 
to  the  goal  statement  based  on  the  axioms  available  to  the  program.  The 
search  space  consisting  of  these  axioms  is  often  described  as  an  And/Or 
tree,  where  an  And  node  correspond  to  a  conjunctive  goal  and  an  Or  node 
correspond  to  the  different  ways  a  unit  goal  can  be  reduced. 

The  Prolog  interpreter  selects  the  first  listed  Or  term  and  then 
proceeds  to  travers  the  And  nodes  from  left  to  right.  Or  parallelism 
occurs  when  several  Or  paths  are  tried  at  once.  And  parallelism  occurs  when 
all  the  conjunctive  pathes  are  tried  at  once.  Goals  in  the  conjunctive  nodes 
may  have  variables  in  common  and  therefore  are  not  independent  of  each 
other.  Thus  the  coordination  of  the  And  parallel  execution  is  necessary. 

To  date,  the  fundamental  approach  to  the  parallel  evaluation  of  Prolog 
clauses  follows  the  pattern  of  constructing  experssions  of  the 
following  form: 

A:-  G1,G2, . ...Gm|  B1,B2, . Bn.  m,n  >=  0 

G's  and  B's  are  atomic  formulas  (unit  goals)  and  A  is  a  clause  head.  The 
G's  are  the  guards  and  the  B's  are  the  body.  A  variable  within  an 
atomic  formula  in  the  body  of  a  clause  can  assume  an  independent  role  when 
evaluated  during  parallel  execution,  or  it  can  assume  a  dependent  role, 
resulting  in  susppension  of  execution  for  the  term  until  a  value  is 
"communicated"  to  it  from  a  term  with  the  same  variable  in  an  independent 
role. 

The  essential  difference  between  functional  and  logic  programs 
is  mostly  considered  in  their  "input  output  directionality". 
Functional  languages  are  directional  in  that  their  programs  make  an 
explicit  commitment  about  which  quantities  are  inputs  and  which  are  outputs. 
Logic  programs  do  not  make  such  commitments.  Non  directionality,  although 
endowing  logic  programs  with  greater  expressibility,  results  in  problems  of 
efficiency  of  execution  and  problems  in  their  parallel  implementations. 

The  "input  output  directionality"  disparity  between  functional 
and  logic  programs  is  the  direct  consequence  of  the  calling  mechanisms 
employed  by  the  two  languages.  Functional  languages  employ  'pattern 
matching'  while  logic  programs  employ  'unification'.  Unification  subsumes 
pattern  matching,  herein  lies  the  capabilities  of  logic  programs  not 
possessed  by  functional  programs.  In  unification,  both  the  goal  and  the 
head  of  the  clause  being  resolved  against  are  allowed  to  contain 
variables,  and  a  successful  unification  produces  two  seperate  substitutions, 
Input  substitution  and  output  substitution.  In  pattern  matching,  only  the 
head  of  the  clause  or  the  goal  are  allowed  to  contain  variables,  thus 
input  or  output  directionality  is  implicitly  specified  permitting  only 
input  or  only  output  substitution  at  runtime. 


The  objective  strived  for  in  the  design  of  SISPRO  was  to  introduce 
directionality  via  the  specification  of  functional  terms,  for  the 
purpose  of  execution  efficiency.  Evaluation  in  such  cases  is  purely 
deterministic  and  backtracking  is  not  needed  by  the  interpreter  for 
undoing  previous  bindings.  The  dual  directionality  of  the  logic  component  of 
the  language  remains  unaffected  and  the  programmer  is  free  to  switch  back 
and  forth  between  functional  clauses  and  relational  clauses. 

In  most  parallel  implementation  of  logic  programs,  in  order  to  avoid 
ambiguity  during  run  time  assignment  process,  modality  has  to  be  specified 
and  variables  that  appear  in  more  than  one  formula  have  to  be  tagged. 
Modality  refers  to  the  specification  of  the  mode  of  evaluation  for  the 
clause.  Tagging  refers  to  the  identification  of  variables  in  terms  of  their 
role  in  the  run  time  binding  priorities.  Such  tagging  goes  by  the  names  of: 
Read  only  variables,  lazy  or  eager  evaluation  variables  and  consumers  and 
producers  variables. 

In  SISPRO  delayed  (lazy)  evaluation  of  the  second  term  of  a  rewrite 
rule,  often  occurs.  When  a  variable  is  unified  with  a  terra  of  the  form  => 
f(foo),  rewriting  must  not  always  be  completed.  Often,  it  is  sufficient 
just  to  do  normal  Prolog  unification  to  the  term  and  proceed  with 
the  rest  of  the  computation  for  a  while.  This  is  a  form  of  delayed 
evaluation  providing  automatic  moding  in  a  parallel  evaluation  environment. 


The  runtime  behavior  of  functional  programs  is  much  simpler  to  control 
than  that  of  logic  programs,  particularly  in  parallel  execution  contex.  By 
reducing,  whenenver  possible,  the  logic  program  to  its  functional 
equivalence,  we  achieve  greater  runtime  efficiency,  greater  clarity  in 
program  statements  and  greater  control  over  program  execution. 


qsort( [ ] )  =>  []. 

qsort([X|U])  =>  append(qsort(Ul ) , [X | qsort(U2) ] ) :- 

partition(X,U,Ul ,U2) . 

partition^,  [],[],[]). 


partition(X, [Y 
partition(X, [ Y 


U] , [Y|U1] ,U2) Y  =<  X,  partition(X,U,Ul ,U2) . 
U],U1,[Y|U2]):—  X  <  Y  ,  partition(X,U,Ul ,U2) . 


Example  2.  Definition  of  the  sorting  algorithm  Quicksort  as  an  hybrid  of 
functional  and  logical  expressions,  resulting  in  a  much  more 
efficient  parallel  evaluation. 


Concluding  remarks: 

The  use  of  logic  programming  as  runnable  specifications  of  software 
systems  is  growing.  In  particular,  its  use  in  specifying  parallel  processes. 
For  efforts  in  this  direction  to  have  an  impact,  logic  programming  needs  to 
be  extended  in  a  number  of  directions.  This  note  examined  via  an 
implementation  example,  one  such  direction  for  extension. 
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Summary 

This  project  is  investigating  parallel  logic  programming  architectures  for  AI  appli¬ 
cations.  Logic  programming  provides  a  convenient  paradigm  for  expressing  parallelism 
because  it  allows  the  separation  of  logic  and  control  in  an  algorithm.  Potential  parallel¬ 
ism  in  a  logic  program  can  be  extremely  large.  A  given  sequence  of  subgoals  Bl,...,Bn 
can  be  solved  by  solving  each  subgoal  Bi  in  parallel  (AND-parallelism).  Solution  of  a 
subgoal  Bi  can  be  attempted  by  trying  all  possible  rules  in  parallel  whose  heads  match  Bi 
(OR-parallelism).  But  this  scheme  generates  extremely  large  number  of  concurrent 
activities,  and  can  only  be  implemented  on  an  idealized  architecture  having  an  unlimited 
number  of  processors  and  zero  communication  overhead.  Furthermore,  much  of  the 
work  done  by  this  unconstrained  search  may  be  redundant.  For  example,  if  two  subgoals, 
BI  and  B2,  share  variables,  then  many  incompatible  solutions  of  BI  and  B2  may  be  gen¬ 
erated.  If  only  one  solution  of  the  original  goal  is  needed,  then  many  solutions  generated 
for  a  subgoal  can  be  redundant.  Since  a  practical  parallel  processor  can  only  have  finite 
amount  of  resources,  we  need  to  develop  a  strategy  which  creates  parallel  activities  in  a 
judicious  manner. 

In  our  research  we  are  investigating  various  (constrained)  ways  of  using  parallelism 
in  logic  programs  and  architectures  for  implementing  them.  We  have  developed  a  paral¬ 
lel  execution  model  for  logic  programming.  The  model  is  process  based  and  uncovers 
both  AND  and  OR  parallelisms,  which  makes  it  particularly  suited  for  AI  applications. 
We  have  developed  a  strategy  which  controls  the  exploitation  of  AND-parallelism  with 
the  help  of  domain-specific  information  provided  by  the  programmer,  and  has  minimal 
run  time  overhead  [4].  When  subgoals  share  variables,  failure  of  a  subgoal  may  mean 
that  some  other  subgoal  needs  to  backtrack  and  generate  a  new  solution.  This  can  be 
done  easily  (by  backtracking  to  the  subgoal  to  the  left  of  the  failed  subgoal)  if  the 
subgoals  are  solved  strictly  sequentially.  But  when  AND-parallelism  is  being  exploited, 
the  backtracking  decision  becomes  nontrivial.  We  have  developed  an  algorithm  for 
backtracking  which  is  more  efficient  than  the  ones  previously  proposed  [2],  [1],  and 
requires  minimal  run  time  overhead  [6].  Another  feature  of  o.ur  execution  model  is  that  it 
can  be  implemented  in  a  fully  distributed  manner,  i.e.,  the  AND  process  solving  a 
sequence  of  subgoals  does  not  have  to  work  as  a  message  center  for  "child"  OR  processes 
[51. 

Due  to  limited  resources,  we  have  to  choose  the  relative  degrees  of  permitted  AND 
and  OR  parallelisms.  The  relative  utility  of  these  two  parallelisms  is  greatly  dependent 
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upon  the  specific  logic  program,  as  in  some  cases  (e.g.,  for  deterministic  logic  programs) 
OR-parallelism  would  only  create  redundant  work,  whereas  in  others  (requiring  lot  of 
backtracking)  OR-parallelism  would  be  very  useful.  We  plan  to  run  simulations  to  study 
the  trade  off  between  AND  and  OR  parallelisms  for  different  kinds  of  logic  programs.  A 
study  to  investigate  the  effectiveness  of  various  strategies  for  exploiting  OR  parallelism 
in  logic  programs  is  already  underway  [3]. 

Allocation  of  processors  to  processes  in  the  physical  architecture  is  another  impor¬ 
tant  issue  needed  to  be  solved.  On  one  hand,  since  processes  have  to  communicate,  we 
do  not  want  to  allocate  related  processes  far  away  in  the  network;  on  the  other  hand, 
since  speedup  is  the  main  concern  of  parallel  execution,  we  would  like  to  distribute 
processes  to  every  available  processor.  The  trade-off  here  depends  on  the  connectivity  of 
underlying  architecture.  Different  strategies  for  sharing  data  structures  among  processes 
can  also  be  used  to  control  the  amount  of  message  traffic  generated.  We  are  currently 
looking  for  some  potential  solutions  to  these  problems  and  plan  to  test  their  usefulness 
via  simulations. 
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Introduction 

The  research  reported  here  is  part  of  an  overall  research  project  into  real-time 
intelligent  systems.  The  goal  of  this  research  is  to  learn  how  to  build  the  next  genera¬ 
tion  of  systems  which  have  extensive  embedded  decision  making  capability.  The  trend 
is  to  embed  ever  increasing  amounts  of  computational  power  into  systems  and  to 
transfer  responsibility  for  decision  making  from  the  human  operator  to  the  computers 
embedded  within  the  system.  The  goal  is  to  be  able  to  build  highly  complex  systems 
that  are  reliable  and  fault  tolerant  and  yet  can  be  operated  by  people  who  do  not  need 
a  detailed  knowledge  of  how  the  system  works.  Specifically  the  research  is  concerned 
with  the  development  of  hardware  and  software  structures  for  real-time  resource- 
limited  decision  making  by  computer. 

Research  Models 

Real-time  intelligent  systems  are  being  investigated  by  building  several  model 
systems  and  looking  for  common  themes.  From  these  themes  a  software  architecture  is 
being  developed  that  can  support  real-time  intelligent  decision  making  and  a  hardware 
architecture  is  being  developed  to  efficiently  execute  the  required  software  paradigms. 

The  decision  making  systems  currently  under  development  include: 

(1)  An  intelligent  robot  which  attempts  to  dynamically  optimize  the  order  in  which  it 
places  piece  parts  from  a  conveyer  belt  into  a  set  of  bins.  As  the  arm  has  limited 
motion  velocities,  the  order  in  which  the  pieces  are  placed  in  the  bins  is  critically 
important  if  a  piece  is  not  to  pass  out  of  the  reach  of  the  arm  before  it  is  picked 
up.  We  are  using  this  problem  to  investigate  the  use  of  incremental  evidence  and 
approximate  decision  making  in  real-time  computation  resource  limited  situations. 

(2)  A  mobile  robot  which  attempts  to  navigate  its  way  to  a  goal  point  some  distance 
away  in  a  maze.  The  robot  does  not  have  any  a  priori  knowledge  about  the  maze 
and  must  learn  about  the  maze  using  a  rangefinding  sensor  which  can  be  pointed 
in  any  direction.  All  decisions  must  take  account  of  the  time  to  compute  the  deci¬ 
sion  and  of  the  robot’s  movement  inertia,  e.g.  the  decision  to  stop  must  be  made 
in  time  to  avoid  hitting  a  wall. 


(3)  A  fault  evaluator  and  a  process  scheduler  for  a  multi-processor  computer  system. 
This  program,  which  must  run  very  fast  and  be  aware  of  its  own  time-line,  will 
evaluate  fault  reports  from  different  processors  and  decide  dynamically  which 
processes  should  run  on  which  processors.  This  is  similar  in  nature  to  (1)  above 
but  has  much  more  severe  computational  constraints  and  a  much  more  dynamic 
environment  It  also  addresses  the  problems  of  distributed  decision  making  and 
the  difficulties  caused  by  false  reports  and  erroneous  decisions. 


Activation  Frame 

As  a  result  of  these  investigations,  a  real-time  decision  making  technique1,2  has 
been  developed  that  performs  planning  by  using  a  number  of  concurrent  processes  that 
execute  in  parallel.  Some  of  the  processes  are  able  to  quickly,  but  not  accurately, 
evaluate  alternative  decisions  and  others  take  longer  to  provide  a  more  in-depth  evalua¬ 
tion.  A  decision  is  made  at  a  required  point  in  time  by  looking  at  the  weight  of  the 
evidence  generated  by  the  different  planning  procedures  up  to  that  time.  This  tech¬ 
nique  has  been  combined  with  prior  work  in  neural  network  modeling,  real-time 
operating  systems,  and  frame  based  AI  systems  to  develop  the  Activation  Framework 
concept^  for  programming  real-time  artificial  intelligence  applications. 

This  technique  is  based  on  the  use  of  Communicating  Expert  Objects  (CEOs). 
Each  object  contains  a  local  domain  of  expertise  embodied  in  hypotheses  and  pro¬ 
cedural  knowledge.  These  objects  communicate  by  means  of  messages  and  can  execute 
in  parallel  on  different  processors.  When  a  message  arrives  at  a  CEO,  it  triggers  the 
execution  of  a  section  of  code  which  may  create  new  hypotheses,  modify  the  evidence 
for  existing  hypotheses,  and/or  delete  hypotheses.  In  addition  the  section  of  code  may 
send  messages  to  other  expert  modules  to  inform  them  of  some  deduction  that  has 
been  made  or  to  request  information  about  certain  hypotheses. 


In  application,  separate  CEOs  could  be  used  for  each  of  the  methods  used  to 
evaluate  the  evidence  for  different  courses  of  action.  Each  of  these  CEOs  would  be 
sent  a  message  containing  information  about  the  decisions  to  be  evaluated  and  the 
world  state  in  which  they  would  be  applied.  The  CEOs  will  then  evaluate  the  deci¬ 
sions  and  send  to  another  CEO  the  evidence  for  or  against  the  proposed  alternate 
courses  of  action.  This  CEO  will  then  make  a  decision  based  on  the  evidence  received 
from  the  other  CEOs. 


The  power  of  this  method  lies  in  the  concept  that  some  CEOs  will  contain  quick 
rules  of  thumb  that  are  able  to  execute  quickly  and  generate  early  evidence.  Other 
CEOs  may  contain  more  exact  evaluation  procedures  that  take  much  longer  to  execute 
and  these  may  involve  the  interaction  with  other  CEOs  that  have  specific  domain 
expertise  needed  to  support  the  evaluation.  As  the  decision  making  CEO  receives  evi¬ 
dence  it  can  evaluate  whether  it  has  enough  evidence  for  a  decision,  or  it  can  choose 
the  best  decision  based  on  its  knowledge  of  the  available  time  before  a  decision  must 
be  made.  In  order  to  realize  the  potential  of  this  method,  an  architecture  is  needed 
which  is  capable  of  supporting  the  execution  of  many  CEOs  in  parallel.  This  lead  to 
the  evolution  of  the  concepts  presented  here. 

Any  CEO  can  communicate  with  any  other  and  does  so  by  means  of  a  message 
addressed  to  the  destination  CEO.  These  messages  have  a  message  type  that  is  used  to 
designate  which  section  of  the  procedural  code  within  the  CEO  will  be  triggered  by 
the  message.  Messages  also  have  an  activation  level  (akin  to  a  priority  level)  that  is  a 
product  of  the  importance  that  the  CEO  places  on  the  message  and  the  decision  mak¬ 
ing  importance  of  the  code  within  the  CEO.  To  realize  this  software  architecture,  a 

hardware  architecture  was  selected  in  which  each  CEO  would  be  executed  within  its 
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own  processor  so  that  each  CEO  could  execute  truly  in  parallel.  In  making  this  choice 
it  was  realized  that  that  at  any  time  many  of  the  CEOs  and  therefore  processors  would 
be  idle.  It  was  postulated  that  such  a  system  would  still  be  a  very  cost  effective  deci¬ 
sion  making  processor  despite  this  inefficiency  because: 

(a)  This  inefficiency  would  be  more  than  compensated  for  by  the  efficiencies  gained 
by  not  having  any  multi-processing  scheduling  overhead  in  each  of  the  processing 
elements, 

(b)  This  design  would  have  the  potential  for  being  realized  at  a  low  cost  per  process¬ 
ing  element  through  the  use  of  VLSI  technologies. 

A  major  thrust  of  our  work  is  to  investigate  such  an  architecture. 

Activation  Cells 

Activation  Cell  Processor  (ACP)  was  chosen  for  the  processing  element  that 
would  contain  each  CEO.  It  was  decided  that,  as  there  would  be  many  A  CPs  in  a  sys¬ 
tem  (our  thinking  number  is  1024),  the  simplest  interconnect  network  would  be  some 
form  of  ring.  Each  ACP,  and  therefore  CEO,  would  have  a  unique  address  on  the  ring 
and  all  messages  would  pass  through  each  ACP  in  the  manner  of  a  token  ring.  In  this 
sense  the  system  is  similar  to  some  data  flow  architectures4  but  performs  a  very  dif¬ 
ferent  function  in  that  its  major  objective  is  decision  making  rather  than  numerical 
computation.  As  currendy  envisaged,  the  ring  of  ACPs  will  have  a  host  interface  by 
means  of  which  messages  can  be  injected  into  the  system  to  cause  it  to  start  making 
decisions  and  by  means  of  which  decision  results  can  be  extracted.  It  is  also  envisaged 
that  messages  will  be  used  to  load  CEO  code  into  each  ACP  and  to  perform  debugging 
and  monitoring  functions. 

Summary 

The  implementation  of  this  activation  framework  architecture  for  real-time  AI  is  a 
major  thrust  of  our  current  research.  Devising  a  software  methodology  for  program¬ 
ming  CEO  based  expert  systems  and  finding  algorithms  to  efficiendy  handle  the  large 
volume  of  messages  in  this  type  of  system  is  an  important  part  of  our  research.  We 
are  also  investigating  an  architecture  for  an  ACP  which  uses  specialized  VLSI  chips  in 
conjunction  with  standard  VLSI  processor  components  to  provide  an  efficient  environ¬ 
ment  for  supporting  the  activation  framework.  We  are  also  pursuing  the  design  and 
simulation  of  a  simple  bit-serial  VLSI  ACP  for  use  as  a  test  vehicle  for  the  activation 
framework  concept 
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There  is  no  "problem  area"  more  important  than  assurance  of  reliable  computer 
systems.  Unfortunately,  reliability  is  becoming  an  even  more  crucial  problem  because  of 
the  criticality  of  many  computer  applications.  One  of  the  reasons  for  the  escalating 
interest  in  reliability  is  that  some  of  the  most  critical  applications  will  be  executed  in 
distributed  computer  environments  and  distribution  introduces  another  level  of 
complexity.  The  preliminary  results  of  a  recent  research  study  indicated  that  failure 
prediction  might  be  a  viable  approach  to  increasing  computer  reliability.  It  also 
appears  that  prediction  could  be  especially  useful  in  increasing  the  reliability  of 
distributed  systems. 

The  main  objective  of  a  failure  prediction  system  is  to  prevent  catastrophic  failures: 
however,  a  secondary  objective  is  to  provide  early  (and  automated)  diagnosis  of  possible 
component  failure.  Early  diagnosis  of  potential  problems  allows  recovery  procedures  to 
be  initiated  sooner  and  thereby  improve  the  possibility  of  preventing  failures.  The 
sooner  reconfiguration  is  enabled  (through  early  warning  of  potential  failure)  for 
systems  having  distributed  architectures,  the  less  chance  there  is  that  data  corruption 
and  other  related  distributed  system  problems  will  occur. 

The  failure  prediction  research  was  part  of  a  study  to  measure  and  model  computer 
reliability  as  affected  by  system  activity  Iyer  82,85j,  Mourad  85  ,  Velardi  84\  Data 
was  collected  from  several  generations  of  mainframe  computers.  During  analysis  of  the 
computer  failure  and  activity  data,  Velardi  observed  that  immediately  before  a  crash 
there  was  an  increase  in  the  number  of  errors  recorded  by  the  operating  system. 
Subsequently,  part  of  the  research  effort  was  directed  toward  determining  the 
characteristics  of  the  change  in  error  generation  prior  to  failure  with  the  hope  of  being 
able  to  provide  a  statistical  basis  for  a  failure  prediction  strategy.  The  first  step  was  to 
characterize  and  cluster  "crashes"  to  find  appropriate  intervals  of  time  for  analysis  of 
data.  The  next  steps  involved  averaging  and  weighting  error  distribution  data, 
analyzing  individual  intervals  between  crashes,  and  analyzing  CPU  utilization  rates 
prior  to  crashes.  A  detailed  methodology  for  analysis  of  failure  prediction  data  was 
developed  and  reported  in  Nas-ar  *5  . 

>ince  the  results  of  a  rigorous  data  analysis  confirmed  the  existence  of  certain  patterns 
in  error  distribution  before  a  era'll,  a  prototype  of  an  actual  failure  prediction  software 
'Vstem  was  implemented.  The  algorithm  was  based  on  the  detection  of  large  error 
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clusters,  that  is,  a  threshold  number  of  errors  which  might  be  precursors  of  a  crash. 
(Threshold  numbers  were  obtained  for  each  type  of  error  from  historical  data.)  When 
this  elementary  strategy  was  tested  by  simulation  using  available  data  as  input,  the 
results  indicated  that  failure  prediction  may  indeed  be  feasible.  Other  parameters  (such 
as  the  gradual  increase  in  error  generation  rate  and/or  system  utilization)  could  be 
included  in  the  algorithm  to  further  refine  and  improve  the  accuracy  of  the  strategy. 
Additional  experimentation  and  study,  however,  is  necessary  to  provide  the  ultimate 
assessment  of  failure  prediction  feasibility. 

The  failure  prediction  strategy  proposed  as  a  result  of  this  research  is  based  on  actual 
error  statistics  (and  probabilities)  and  could  be  implemented  using  artificial  intelligence 
technologies.  A  particularly  useful  feature  would  be  to  build  an  adaptive  or  learning 
algorithm  into  the  implementation  of  a  failure  prediction  strategy  in  order  to  improve 
diagnostic  capabilities.  If  this  were  done,  then  changes  in  architectures  of  the  host 
computer  would  not  require  a  completely  new  implementation  of  a  failure  prediction 
strategy. 

This  work  was  supported  in  part  by  the  L’.S.  Army  Research  Office  under  contract 
number  D.AAG29-82-K-I05.  The  views,  opinions,  and/or  findings  contained  in  this 
document  are  those  of  the  author  and  should  not  be  construed  as  an  official  Department 
of  the  Army  position,  policy,  or  decision,  unless  so  designated  by  other  official 
documentation. 
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Programming  Language  Translation  for  Multicomputers 
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Everyone  seems  to  agree  that  parallel  computing  is  an  essential  part  of  the 
future,  and  that  programming  is  a  major  challenge  in  effective  use  of  parallel  sys¬ 
tems.  The  problem  for  language  and  compiler  researchers  in  the  coming  years  is  to 
provide  languages  convenient  to  programming  and  compilers  to  translate  them  to 
efficient  parallel  code. 

We  can  divide  the  problem  of  compiling  for  parallel  target  machines  into  two 
phases:  parallelism  detection  and  parallelism  exploitation.  Detection  is  the  process 
of  discovering  which  activities  may  be  performed  in  parallel;  exploitation  is  the 
process  of  determining  which  activities  will  in  fact  be  performed  in  parallel  on  the 
limited  resources  of  the  machine.  Parallelism  detection  can  be  performed  without 
regard  to  the  characteristics  of  the  target  machine;  parallelism  exploitation  is 
dependent  on  the  machine,  but  is  independent  of  the  language  in  which  the  pro¬ 
gram  was  written. 

Although  detection  and  exploitation  can  be  separated,  they  are  typically  con¬ 
sidered  together,  and  most  papers  present  a  technique  for  a  particular  language 
and  target  machine.  Vectorizing  compilers,  for  expample,  only  search  for  the 
kinds  of  parallelism  that  a  vector  machine  can  exploit.  The  trace  scheduling  tech¬ 
nique  is  closely  associate  with  the  Bulldog  compiler  and  the  VLIW  machine[‘2). 

Although  there  is  nothing  wrong  with  combining  the  two  phases  in  a  particu¬ 
lar  compiler,  there  is  something  to  be  gained  by  separating  them.  Just  as  tradi¬ 
tional  optimization  techniques  for  sequential  machines  are  divided  into  machine- 
independent  and  machine-dependent  parts,  so  should  optimizations  for  parallel 
machines  be  divided  into  machine-independent  parallelism  detection,  and 
machine-dependent  parallelism  exploitation.  Future  work  on  parallel  compilers 
will  benefit  by  maintaining  the  distinction. 

When  one  discusses  parallel  compilers,  one  of  the  first  questions  asked,  unfor¬ 
tunately,  is  "what  language  will  you  compile!*"  The  choice  of  language  is,  of  course, 
very  important.  Features  of  the  language  may  help  or  hinder  the  compiler  in  the 
discovery  of  parallelism.  More  importantly,  a  good  language  for  parallel  computa¬ 
tion  will  encourage  the  programmer  to  write  code  so  that  parallelism  inherent  in 
the  algorithm  is  not  obscured,  and  perhaps  to  think  of  algorithms  that  have  more 
inherent  parallelism.  But  the  choice  of  language  really  only  affects  parallelism 
detection,  and  the  emphasis  on  languages  tends  to  draw  attention  away  from  the 
equally  important  problem  of  parallelism  exploitation. 

A  parallel  loop  may  be  written  as  a  sequential  for-loop.  with  the  t  irallelism 
being  detected  by  careful  analysis  of  data  flow,  or  it  may  be  written  as  an  explicitly 


parallel  forall  loop,  or  it  may  be  written  as  a  single  operation  to  an  entire  array,  as 
would  be  typical  in  APL.  Having  discovered  that  the  loop  is  potentially  parallel, 
the  compiler  must  still  determine  whether  the  parallelism  can  be  used  by  the  target 
machine,  and  whether  using  the  parallelism  is  likely  improve  the  performance  of 
the  program.  If  parallel  computation  increases  contention  for  resources  or  over¬ 
head  for  synchronization,  then  the  speedup  due  to  parallelism  may  be  lost.  On 
some  machines,  careful  selection  of  parallel  operations  and  assignment  to  proces¬ 
sors  and  memory  is  as  important  as  detecting  the  parallelism  in  the  first  place. 

Since  parallel  compilation  techniques  are  usually  presented  in  the  context  of  a 
particular  model  of  parallel  computation,  their  potential  applicability  to  other 
models  of  computation  is  often  given  little  notice.  Loop  analysis  techniques  are 
most  often  applied  to  vector  and  array  processors,  but  the  parallelism  detection 
they  provide  may  be  used  on  a  variety  of  machines.  Trace  scheduling  is  associated 
with  the  VLIW  machine  [2],  but  may  be  useful  on  other  multiprocessor  systems  [1]. 
By  paying  attention  to  what  the  machine  dependencies  really  are  in  various  compi¬ 
lation  techniques,  researchers  will  be  better  able  to  share  and  adapt  good  stra¬ 
tegies.  The  sharing  of  similar  strategies  for  parallelism  detection  and  exploitation 
will  also  facilitate  comparison  of  various  models  of  parallel  execution. 

Carefully  maintaining  the  distinction  between  the  machine  independent  and 
dependent  parts  of  a  parallel  compiler,  then,  will  aid  future  work  in  several  ways. 
First,  it  will  help  to  emphasize  that  the  goal  of  a  new  language  for  parallel  comput¬ 
ing  is  not  to  make  the  programmer  code  in  more  parallelism,  but  to  keep  him  from 
coding  in  sequentiality.  Ideally,  the  programmer  should  be  removed  from  low  level 
details  of  parallelism.  Second,  it  will  help  to  show  the  similarities  and  differences 
in  various  compiling  techniques.  And  third,  it  will  help  to  make  software  support 
more  uniform  across  a  variety  of  parallel  architectures,  simplifying  the  problem  of 
comparing  their  merits. 
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1.  Availability  and  Replication 

Distributed  systems  have  the  potential  to 
achieve  better  reliability  and  performance  because 
of  their  many  resources.  Atomic  actions  [7]  have 
been  proposed  to  guarantee  correct  system 
function  regardless  of  site  crashes  and  conflicts 
between  concurrent  operations.  By  ensuring  the 
atomicity  of  a  Remote  Procedure  Call  (RPC)  (6), 
atomic  RPC  [5]  can  provide  a  transparent  and 
reliable  primitive  which  hides  distributed 
programming  complications  from  naive  users. 

One  desired  property  which  is  not  provided 
by  atomic  RPC  is  availability:  the  system  should 
be  available  with  high  probability.  Replication  [4] 
has  been  proposed  as  a  means  to  increase 
availability;  either  data  or  operations  are 
replicated  on  independent  sites  so  that  some  sites' 
failures  still  leave  the  system  with  enough 
resources  to  continue  execution.  Replicated  data 
are  stored  on  different  sites  and  are  usually 
identified  by  version  numbers.  Algorithms  exist  to 
guarantee  consistency  between  copies  [2,3].  Due  to 
the  recent  dramatic  decrease  in  the  cost  of 
computer  hardware,  it  is  now  possible  and  will  be 
common  in  the  future  to  have  many  processors 
devoted  to  a  single  application.  We  are 
implementing  a  new  construct,  Resilient  Procedure 
(RP),  which  allows  a  procedure  to  be  hierarchically 
replicated  at  several  sites  so  that  the  service 
provided  by  the  procedure  will  be  available  despite 
site  failures.  The  RP  construct  has  several 
desirable  properties: 

(1)  Fault-Tolerance:  By  replicating  a 

prodedure  at  k  sites,  the  resulting  RP  can 
survive  up  to  k-1  site  failures. 

(2)  Transparency:  From  a  caller’s  point  of 
view,  a  RP  is  the  same  as  a  normal 
procedure  providing  the  same  service.  Thus 
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a  normal  procedure  can  be  replaced  by  a  RP 
whenever  the  resilience  or  the  concurrency  is 
desired. 

(3)  Efficiency:  RP’s  may  yield  performance 
gains,  as  some  sites  may  be  less  loaded, 
providing  shorter  response  time  than  if  the 
procedure  were  performed  at  a  single  site. 

In  Section  2,  we  briefly  describe  the  RP 
construct.  We  discuss  some  fault  tolerance  issues 
in  Section  3. 

2.  Resilient  Procedure 

We  assume  first  that  all  sites  are 
homogeneous.  Heterogeneous  environment 
problems  will  be  discussed  in  section  3.  Networks 
are  assumed  to  be  asynchronous  but  reliable,  i.e. 
messages  are  always  delivered  correctly  and  in- 
order. 

A  RP  is  a  special  procedure  construct  which 
consists  of  a  coordinator  and  several  cohorts 
(Figure  1)  residing  on  multiple  sites.  The 
coordinator  is  the  only  entity  of  the  RP  visible 
from  outside.  All  incoming  call/return  messages 
are  directed  to  the  coordinator  and  all  outgoing 
call/return  messages  are  sent  from  the  coordinator. 


Figure  1  RP  called  by  A  and  calls  B 


When  an  incoming  call  message  is  received 
by  the  coordinator,  it  determines  if  the  call  can  be 
accepted,  just  like  a  normal  atomic  procedure.  If 
the  call  is  accepted,  it  is  then  transferred  to  the 
cohorts.  Upon  receiving  the  call  message  from  the 
coordinator,  each  cohort  starts  its  execution 
independently.  No  global  coordination  is  required. 

When  a  cohort  wants  to  call  an  external 
procedure,  instead  of  sending  the  call  message  to 
the  callee  directly,  the  call  is  sent  via  the 
coordinator.  The  coordinator  collects  all  cohort 
call  messages  destined  for  the  callee  and  sends  only 
one  call.  The  fact  that  there  are  multiple  cohorts 
is  transparent  to  the  external  procedure  which 
performs  a  single  call  as  if  it  were  from  a  non- 
resilient  procedure. 

The  coordinator  does  not  have  to  wait  until 
all  of  its  cohorts  have  requested  the  call  to  the 
same  callee.  Although  some  cohorts  may  be  slower 
than  others,  all  will  eventually  make  the  call 
(unless  they  crash  -  this  is  handled  separately).  If 
a  fast  cohort  has  made  the  call,  and  the  result  has 
already  been  returned,  then  the  coordinator  simply 
returns  that  result  to  the  slower  cohort 
immediately.  This  approach,  besides  speeding  up 
the  overall  performance,  synchronizes  all  cohorts 
by  allowing  a  slow  cohort  immediate  access  to 
external  results. 

When  an  external  procedure  call  result  is 
received  by  the  coordinator,  it  checks  to  see  which 
cohorts  are  waiting  for  the  result.  The  result  is 
distributed  to  those  cohorts  which  then  resume 
execution.  Upon  completion,  each  cohort  returns 
its  result  to  the  coordinator.  The  coordinator 
collects  results  from  all  cohorts  and  returns  a 
single  result  to  the  caller. 

8.  Fault-Tolerance  Issues 

8.1.  Atomicity  Considerations 

Concurrency  control  is  required  to 
implement  atomic  procedure  calls.  One  procedure 
may  receive  call  messages  from  several  concurrent 
caller  procedures;  only  one  of  the  calls  should  be 
accepted  to  avoid  inconsistent  read/ write 
interleaving  on  local  data.  Furthermore,  if  a 
nested  atomic  call  model  is  used,  a  non -recursive 
new  call  must  be  delayed  until  the  previous  atomic 
call  has  committed.  Since  the  coordinator  is  the 
only  entity  visible  from  other  procedures  ,  it  must 
perform  concurrency  control  for  the  RP. 


Two  types  of  failures  can  occur  during  a 
procedure’s  execution:  expected  or  unexpected 
failures.  An  expected  failure  can  be  handled  by 
some  kind  of  exception  handling  mechanisms.  An 
unexpected  failure  is  usually  handled  by  a 
backward  recovery  mechanism.  In  the  RP 
environment  they  can  be  treated  as  site  crashes, 
merely  reducing  the  number  of  cohorts  in  the  RP. 
However,  If  we  can  guarantee  that  each  RP  always 
has  at  least  one  active  cohort  available  and  if  all 
procedures  in  the  system  are  implemented  with  the 
RP  construct,  then  no  backward  recovery  is 
needed. 

8.2.  The  Coordinator  Resiliency 

The  coordinator  is  the  critical  component  of 
the  RP;  its  failure  renders  the  cohorts  unreachable. 
Therefore,  if  the  coordinator  crashes,  one  of  the 
cohorts  must  be  promoted  to  become  the  new 
coordinator  for  the  RP.  In  order  to  allow  a  cohort 
to  become  a  coordinator,  any  system  information 
maintained  by  the  coordinator  must  be  replicated 
in  each  of  the  cohorts.  Our  approach  is  to 
piggyback  updates  to  the  system  information  table 
onto  the  call  request  and  return  messages 
exchanged  between  the  coordinator  and  its  cohorts. 
Since  the  system  information  is  updated  only  when 
there  are  new  call/return  activities,  there  is  always 
communication  between  the  coordinator  and 
cohorts  immediately  following  a  table  update.  So 
the  coordinator  does  not  have  to  wait  a  long  time 
before  the  update  is  propagated. 

When  the  coordinator  crashes,  the  cohort 
which  communicated  with  the  coordinator  most 
recently  has  the  current  system  table,  and  will 
become  the  new  coordinator.  Each  cohort  keeps  a 
timestamp  sent  by  the  coordinator  during  the  last 
successful  conversation.  By  comparing  timestamps 
in  a  voting  protocol,  the  cohort  with  the  latest 
timestamp  is  promoted. 

8.3.  Heterogeneous  Environment 

Cohorts  running  in  a  heterogeneous 
environment  can  return  different  results  with 
different  response  times.  Depending  on  the 
application,  many  protocols  can  be  used  to  choose 
one  result  from  the  returned  values.  Some  of  the 
possible  protocols  are:  majority,  first,  or  primary. 
Majority  protocol  takes  the  result  of  the  majority 
of  the  cohorts  and  is  used  to  deliver  more  reliable 
results  in  a  noisy  environment.  First  protocol' 
returns  the  first  result  received  by  the  coordinator 
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and  is  used  to  improve  performance.  Primary 
result  protocol  is  used  if  one  cohort  is  known  to  be 
the  most  accurate  or  reliable. 

A  heterogeneous  environment  can  be  used  to 
detect  faults  at  cohort  sites.  By  replicating  a 
procedure  onto  multiple  sites,  the  RP  becomes 
tolerant  of  transient  physical  faults.  If  the  sites 
are  heterogeneous,  the  RP  then  provides  hardware 
design  tolerance.  Software  design  tolerance  for  the 
system  software  and  execution  environments  at  the 
distributed  sites  is  also  provided.  Furthermore,  if 
different  software  versions  [1]  are  executed  at 
different  cohorts,  software  fault-tolerance  can  also 
be  achieved. 

The  RP  construct  allows  the  user  a  great 
degree  of  freedom  in  choosing  the  method  used  by 
the  coordinator  to  resolve  inconsistent  cohort 
execution  results.  The  advantages  of  a  consensus 
decision  are  imparted  by  replication,  and 
application  specific  criteria  can  also  be 
incorporated  into  the  decision  algorithm. 

4.  Conclusion 

In  this  paper,  we  presented  a  new  system 
construct,  Reslient  Procedures,  which  has  the 
desirable  properties  of  fault-tolerance  and 
availability.  We  discussed  the  atomicity  and 
heterogeneous  issues  that  can  occur  when 
replicating  a  procedure.  We  showed  further  that 
the  structure  can  be  used  for  many  applications  in 
distributed  and  parallel  computations. 

By  allowing  the  user  to  specify  the  degree  of 
replication,  the  diversity  of  replications  (e.g. 
heterogeneous  sites,  software  versions,  etc.),  and 
the  method  of  coordination,  the  RP  construct 
provides  the  user  a  friendly  and  powerful 
programming  environment. 
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Introduction 

Increased  computer  power  and  performance  in  recent  years  made  possible  advanced 
applications  in  such  fields  as  avionics  systems,  aerodynamic  simulations,  industrial  automa¬ 
tion,  military  defense,  weather  forecasting,  and  more.  Most  of  these  advances  are  based  on 
the  concept  of  parallel  processsing.  Parallelism  can  be  applied  at  various  levels  of  computa¬ 
tion,  i.e.,  the  algorithm  level,  program  level,  system  level,  machine  level,  and  the  microlevel. 
Research  effort  on  parallel  processing  currently  underway  has  been  directed  mostly  at  the 
higher  end  of  the  computationectrum.  For  example,  work  on  multiprocessing,  pipeline 
machines  and  distributed  computers  is  well  documented  in  the  literature.  Little  has  been 
done  at  the  lower  end,  with  the  exception  perhaps  of  systolic  array  structures. 

The  aim  of  this  work  is  to  propose  an  approach  to  low  level  parallel  processing.  In  our 
view,  parallelism  at  the  microlevel  refers  to  a  multiprocessing  environment  controlled  by 
cooperative  microengines  executing  concurrent  microcode.  This  environment  provides  the 
capability  for  parallel  implementation  of  functions  via  concurrent  microprogramming.  This 
would  essentially  constitute  parallel  firmware  migration.  The  reasons  for  parallel  migration 
are  quite  similar  to  the  ones  of  function  migration  in  uniprocessing  structures;  namely,  sub¬ 
stantial  gains  in  speed,  reliability  and  stability.  Moreover,  due  to  faster  time  response, 
microlevel  parallelism  would  be  better  equiped  to  handle  real-time  applications,  e  g.,  mul¬ 
tisensor  functioning  of  complex  automation  systems.  In  what  follows,  first  we  present  the 
essential  ideas  of  a  new  technique  for  function  migration  in  a  uniprocessor  controlled 
environment.  Then,  we  describe  the  structure  of  a  cooperative  microcontrol  system 
together  with  extended  sequencing  protocols  to  support  concurrency  at  the  microlevel 

Migration  Technique 

Recall  that  vertical  migration  of  frequently  used  software  into  firmware,  labeled 
firmware  migration,  is  a  well  known  technique  for  improving  the  system  performance.  It 
has  been  practiced  by  the  designers  of  such  systems  as  the  IBM  370  and  the  DEC  PDF  1 1 
series.  However,  firmware  migration  has  been  influenced  by  VLSI  technology  due  to  the 
capability  to  embed  in  silicon  not  just  ’’traditional”  microprograms,  i.e.  instruction  set 
interpretations,  but  also  complicated  software  functions  such  as  data  tables,  parsers  or 
operating  system  primitives.  In  addition  to  system  software,  complex  algorithms  appear  in 
other  areas  that  may  have  military  applications  such  as  communication  algorithms  and 
radar  tracking.  In  general,  such  functions  have  complex  logical,  i.e.  sequencing,  structure 
They  are  represented  by  complex  flow  charts  with  many  multiway  branchings,  nested  loop- 
ings  and  the  like.  Thus,  cost-effective  migration  of  such  functions  requires  modular 
microprogram  structures  with  powerful  sequencing  capability. 


Our  major  objective  of  our  recent  work  has  been  to  explore  a  new  methodology  for 
automated  software- to-firmware  migration  based  on  a  new  microcontrol  architecture  we 
have  developed  recently.  This  scheme  is  endowed  with  complex  sequencing  constructs 
available  at  the  microlevel.  We  feel  that  our  approach  has  the  advantage  of  compatibility 
with  VLSI  in  that  it  effects  software  migration  into  VLSI  microcode  via  firmware  tech¬ 
niques.  Such  an  automated  migrator  could  be  an  important  ingredient  of  a  hierachial  sili¬ 
con  compiler  if  it  is  interfaced  to  appropriate  CAD  layout  tools. 

The  basic  idea  of  the  proposed  method  is  to  capture  the  sequencing  structure  of  the 
candidate  function  via  effective  compilation  techniques.  This  idea  can  be  described  by  the 
following  abstract  migration  model  which,  incidently,  can  be  employed  at  several  computa¬ 
tion  levels  but  it  is  very  effective  at  the  microlevel.  Every  function  F  can  be  represented  by 
the  following  two  data  structures: 

1.  A  sequencing  graph,  dependent  on  F. 

2.  A  library  of  operation  modules,  which  is  independent  of  F. 

The  sequencing  concept  encapsulates  essentially  the  sequencing  structure  of  F.  It  consists 
of  a  listing  of  pure  sequencing  constructs  including  sequence  calls  to  the  library.  The 
modules  in  the  library  consist  of  series  of  command-types  performing  related  operations 
comprising  a  meaningful  subfunction,  for  example  ADD  module.  Specifically,  the  modules 
comprise  the  structured  firmware  implementation  of  an  instruction  set  of  a  base  machine  on 
which  the  migrating  function  is  to  be  tested.  If  this  firmware  code  is  not  already  available, 
it  may  be  produced  by  a  microcode  generator  tool  and  executed  on  the  base  machine. 

The  important  point  here  is  that  the  library  is  a  universal  structure,  i.e..  it  should  be 
general  enough  to  cover  a  collection  of  functions  but,  once  it  is  written,  it  can  be  employed 
with  every  function  in  the  collection.  To  perform  migration,  in  this  model,  we  only  need  to 
generate  the  sequencing  structure  of  each  F  and  then  link  it  to  the  library.  The  basic  steps 
to  do  this  are:  a)  function  high-level  compilation,  b)  control  flow  graph  construction,  and  c) 
sequencing  constructs  detection  in  the  function  graph.  The  end  result  is  the  generation  of 
microsequencing  code  for  the  function,  consisting  of  the  constructs  mentioned  earlier,  par¬ 
ticularly  sequence  calls  to  the  library  of  microcode  modules. 

An  important  requirement  for  the  proposed  method  to  be  practical  is  the  retargetabil- 
ity  of  the  migration  process.  This  term  means  the  adaptability  of  firmware  migration  to  a 
new  base  machine  obtained  by  architecture  redefinition.  The  latter  is  usually  accomplished 
by  means  of  a  hardware  design  language  at  the  register-transfer  level.  In  our  view  the  fol¬ 
lowing  three  problems  are  involved: 

(a)  Redefinition  of  the  base  machine  structure. 

(b)  Retargetability  of  the  migrator  sequencing  code. 

(c)  Library  construction  for  the  machine  defined. 

Our  approach  involves  the  use  of  a  hardware  design  language  (HDL)  such  as  ISP'-N.mPc 
developed  by  other  colleagues  at  Case.  The  basic  property  of  these  languages  is  their  capa¬ 
bility  to  define  a  microarchitecture  by  its  structure  and  behavior.  We  plan  to  expand  this 
capability  by  defining  the  control  scheme  to  be  used  Our  basic  idea  to  achieve  retargeta- 
bility  is  to  capture  the  assembly  code  version  of  the  migrating  function  and  then  proceed 
along  the  steps  described  earlier  However,  we  propose  to  use  as  assembly  code  the 
behavioral  statements  of  the  HDL.  Further,  we  plan  to  use  a  preprocessor  which  will 
translate  the  high-level  language  description  of  the  migrating  function  into  the  HDL 
behavioral  code.  Machine  dependencies  will  be  introduced  at  that  level  by  binding  together 


the  structural  and  behavioral  descriptions  of  the  machine.  A  major  advantage  of  this  tech¬ 
nique  is  that  it  does  not  require  changes  in  the  migration  process  when  the  base  machine  is 
redefined.  Thus,  the  problem  of  migration  retargetability  is  handled  by  the  means  used  for 
architectureecification. 

Parallelism  at  the  Micro  level 

In  general,  parallel  processing  at  the  higher-end  refers  to  the  concurrent  operation  of 
several  processors  which  share  a  number  of  resources.  Clearly,  there  should  be  a  control 
mechanism  or  protocol  to  regulate  the  accessing  and  releasing  of  resources  to  avoid  coflicts. 
Such  regulation  mechanisms  based  on  mutual  exclusion  principles  are  well  documented  in 
the  literature.  These  general  principles  also  apply  to  parallelism  at  the  microlevel.  In  our 
approach,  we  consider  a  cooperative  microcontrol  environment  consisting  of  several  micro¬ 
controllers  each  executing  a  migrating  function.  The  proposed  scheme  is  briefly  discussed 
below. 

Consider  m  microcontrollers  Cl,...,Cm,  that  implement  concurrently  the  migration 
functions  Fl,...,Fm,  rectively.  More  precisely,  each  Ci  contains  the  sequencing  structure  of 
Fi  in  its  sequencing  store  Si.  Moreover,  there  exists  a  library  Li  of  microcode  modules 
embedded  in  the  microcode  store  Mi  of  Ci,  i=l,...,m.  There  are  also  n  processing  regions 
Pl,...,Pn  which  can  be  requested  by  the  microcontrollers  such  that  Pi  can  be  requested  by 
only  one  microcontroller  Cj  at  the  time.  We  make  no  assumption  as  to  the  nature  of  pro¬ 
cessing  regions  Pi,  in  other  words,  they  may  be  processors,  peripherals,  memories  or  simply 
distributed  hardware.  The  requests  of  the  processing  regions  are  made  by  explicit  reference 
of  the  regions  in  the  code  structure  of  the  library  microcode  modules.  For  example,  an 
ADD  type  of  module  in  the  second  library  L2  may  refer  to  an  adder  located  in  the  third 
processing  region  P3.  These  references  of  microcode  modules  to  regions  may  be  static  or 
dynamic.  The  preceding  adder  example  is  a  static  reference  whereas  an  indirect  reference 
to  a  main  memory  location  is  dynamic. 

The  main  problem  of  the  cooperative  microcontrol  scheme  is  a  resource  management 
problem  at  the  microlevel:  the  effective  regulation  of  the  competetive  requests  of  processing 
regions  coming  from  microcode  modules.  Although  solutions  to  this  problem  at  the  high  end 
are  well  documented  in  the  literature,  the  problem  has  not  even  been  formulated  in  the 
microlevel  context.  The  advantage  of  our  approach  is  in  utilizing  a  microcontrol  scheme 
with  poweful  sequencing  capability  supported  by  developed  tools.  We  feel  that  for  coopera¬ 
tive  microcontrol  we  need  to  extend  the  proposed  sequencing  constructs  to  include  con¬ 
structs  that  provide  synchronization  at  the  microlevel.  The  coordination  protocol  is  con¬ 
ceptually  described  below,  assuming  that  each  microcode  module  requests  a  single  process¬ 
ing  region  at  each  time. 

When  the  microcode  module  Dip,  located  in  library  Li,  is  invoked  by  a  sequence  call, 
then  the  controller  Ci  should  issue  a  signal  indicating  the  attempt  to  ’’enter”  (request)  a 
processing  region  Pk.  Again,  Pk  is  referenced  in  the  code  structure  of  Dip.  If  Pk  is  not 
currently  "occupied”  i.e.  not  requested  by  another  module  Djq,  then  Dip  can  safely  enter 
Pk.  However,  if  Pk  is  currently  occupied,  then  Ci  should  enter  a  ”wait-until”  loop  till  Pk  is 
"released.”  During  this  waiting  time  the  microcode  store  should  be  in  a  NO-OP  state, 
perhaps  via  a  No-Op  module.  Additional  sequencing  constructs  required  are  "enter  region" 
"exit  region",  "return  without  exit”  and  ’’return  with  exit”.  These  should  be  included  in 
the  repertoire  of  constructs  for  microlevel  implementation  of  parallelism  in  the  context  of 
our  scheme. 
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ABSTRACT 

This  paper  describes  some  preliminary  results  obtained  by  the 
authors  while  attempting  to  design  a  systolic  signal  processing 
element  to  be  Implemented  using  the  Tektron Ix-TRIQUINT  GaAs  E/D 
MESFET  Foundry. 


INTRODUCTION 

The  design  constraints  Implied  by  the  literature  published  by 
TRIQUINT  for  their  E/D  MESFET  gates  are: 

1  )  Fan  1  n  <.  3 

2)  Fanout  <  4 

3)  Number  of  gates  per  testable  cell  <  80  (for  83%  yield) 

The  goal  Is  to  design  a  systolic  cell  similar  to  that  Implemented 
by  the  Saxby  Computer  Corporation  but  using  GaAs.  Because  of  the 
yield  problems  of  GaAs,  the  architecture  had  to  be  decomposed 
Into  small  "bit-slice"  cells  which  could  be  fabricated,  tested 
and  Interconnected  to  form  the  systolic  element.  The 
Interconnection  process  employs  fine  geometry  (5-10  pm)  thick 
metal  lines  embedded  In  a  polylmide  dielectric.  The  logic  cells 
can  be  "In"  the  substrate  or  "on"  the  substrate. 

Given  the  space  constraints  of  this  article,  here  we  consider 
only  the  Implementation  of  the  systolic  cell  function  Y=X+A*B  In 
16-bit  arithmetic.  Since  the  TRIQUINT  process  Involves  1  ym 
feature  sizes,  their  lithography  employs  step  and  repeat  reticles 
(up  to  4  may  be  specified).  Figure  1  shows  the  bit  slice  cells 
defined  as  module  subdivisions  of  these  reticles.  The  four 
modules  M-j  -  M^  are  various  column  counters  for  the  multiplier 
and  the  S  cells  are  all  Identical  four-bit  carry  select  adders. 

If  these  cells  were  fabricated  and  mounted  In  separate 
packages,  the  packages  would  adversely  affect  the  performance  of 
the  system.  By  directly  connecting  the  dies  on  the  wafer  (wafer 
scale  Integration  or  wafer  scale  hybrid)  an  extremely  dense 
system  Is  feasible. 

Figure  2  shows  the  overall  architecture  of  this  systolic 
cell.  Figure  3  shows  a  simulated  four-inch  wafer  with  several 


hundred  (at  70*  yield)  of  the  desired  cells  In  the  rough 
proportions  demanded  by  the  architecture.  Figure  4  shows  the 
routed  wafer.  Evidently*  there  will  be  plenty  of  room  to  scale 
up  the  processor  to  a  full  32-bit  processor  possibly  with 
floating  point  features. 

A  histogram  of  the  typical  wiring  lengths  required  for  the 
16-bit  Integer  case  Is  shown  In  Figure  5.  The  estimated  maximum 
delay  for  this  version  of  the  systolic  cell  Is  7.92  ns  plus  I/O 
pad  delays.  Improved  versions  are  definitely  possible* 
especially  If  the  TRIQUINT  yields  are  even  only  slightly  better. 

CONCLUSION 

Wafer  Integration  of  many  small  GaAs  E/D  MESFET  cells  offers 
an  attractive  alternative  to  VLSI  when  the  yield  Is  low  for  the 
1 atter . 
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Figure  1.  Step-and-Repeat  Reticle 
Module  Assignments 
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Figure  2.  Overall  Systolic  Cell  Architecture 
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Figure  3.  Four-inch  diameter  wafer  with 
dies  at  70%  yield 


Figure  4.  Routed  wafer  for  systolic  architecture 


Figure  5.  Histogram  of  wire  lengths  used  in  architecture 
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DESIGN  FOR  A  TPL  COMPILER  SYSTEM  —  A  SYSTEM  FOR 
RETARGETING  HIGH  LEVEL  LANGUAGE  PROGRAMS 

S.  Leong,  0.  Jiang,  S.  Jodis,  and  P.A.D.  de  Maine 

Computer  Science  and  Engineering  Department 
Auburn  University 
Auburn,  Alabama  36849 


I.  INTRODUCTION: 

A  key  problem  is  the  retargeting  of  software  coded  in  high 
level  languages  so  that  they  will  efficiently  execute  in  new 
computer  environments  that  may  have  totally  different  computer 
architectures  and/or  supporting  software.  A  standard  high  level 
language  like  ADA  [1 J  that  is  fully  transportable  cannot,  by 
itself,  solve  the  fundamental  problem  of  transporting  existing 
programs  to  new  environments.  Mimicking,  either  by  hardware  or 
software,  the  source  environment  is  not  a  viable  solution  because 
such  methods  [2J:  Must  be  separately  implemented  for  every  diffe¬ 
rent  object  environment;  and  they  do  not  fully  exploit  the  charac¬ 
teristics  of  the  new  environments. 

In  the  Transportable  Programming  Language,  TPL,  Method  [3j  the 
essential  concept  is  a  retargetable  bifunctional  compiler,  called 
the  HLL-TPL  Compiler,  that  uses  a  hypothetical  language,  called 
the  Hypothetical  Parent  High  Level  Language  (HPHLL),  as  the  trans¬ 
porting  or  retargeting  vechile.  The  HLL-TPL  Compiler,  which  can  be 
retargeted  for  any  environment  (or  dialect),  converts  the  local 
dialect  to  the  HPHLL  and  visa-versa.  That  means  the  TPL  System, 
which  consists  of  the  HLL-TPL  Compiler  and  its  supporting  soft¬ 
ware,  can  be  transported  or  retargeted  by  itself. 

The  TPL  system  that  has  evolved  has  three  different  parts: 

1 .  The  Support  Facilities 

2.  The  Hypothetical  Parent  High  Level  Language,  HPHLL 
3*  The  HLL-TPL  Compiler  System 

The  Support  Facilities  are  used  by  the  bifunctional  HLL-TPL  Compi¬ 
ler  to  compensate  for: 

(a)  Differences  in  architectural  features  like  the  byte  or  word 
sizes  and  the  amount  of  available  core-memory. 

(b)  Special  instructions  that  exploit  architectural  features,  like 
the  character  type  operation  in  IBM  FORTRAN. 

The  Hypothetical  Parent  High  Level  Language,  HPHLL,  for  every 
different  high  level  language  is  constructed  from  its  dialects  as 
described  in  [3]-  Essentially  the  HPHLL  is  a  superset  that  con¬ 
tains  all  the  instructions  in  all  the  dialects  of  the  HLL. 

The  HLL-TPL  Compiler  System  that  is  now  being  tested  has  three 
parts  ; 


2.  The  Table  Generator. 

3*  The  HLL-TPL  Compiler  itself. 


An  essential  part  of  the  design  is  the  Conversion  Rules  Descrip¬ 
tion  Language,  CRDL  [4]*  CRDL  is  a  meta-language  that  is  used  to 
describe  the  procedures  for  converting  one  language  into  another. 
The  CRDL  descriptions  of  the  conversion  procedures  are  used  by  the 
Table  Generator  to  produce  the  TTM  and  CM  Tables  that  drive  the 

HLL-TPL  Compiler  in  its  two  modes:  For  converting  the  local  dia¬ 

lect  to  HPHLL  and  the  HPHLL  to  the  local  dialect. 

The  Rule  Modifier  and  Table  Generator  are  normally  used  only 
when  the  HLL-TPL  System  is  installed,  while  the  HLL-TPL  Compiler 
is  used  for  each  program  conversion.  The  roles  of  the  Rule  Modi¬ 
fier,  Table  Generator  and  HLL-TPL  compiler  are  discussed  in  Sec¬ 
tions  II,  III  and  IV. 

II.  RULE  MODIFIER: 

The  Rule  Modifier  is  a  special  purpose  editor  that  accepts 

descriptions  of  differences  between  the  Default  Dialect  (or  envi¬ 
ronment)  and  the  Local  Dialect  (or  environment)  and  then  modifies 
the  files,  coded  in  CRDL  [4],  that  are  used  by  the  Table  Genera¬ 
tor  to  produce  the  tables  that  actually  drive  the  HLL-TPL 

Compiler.  The  functions  of  the  Rule  Modifier  are  as  follows. 

1 •  It  is  a  tool  for  altering  the  CRDL  descriptions  of  the  Default 
Dialect . 

2.  It  is  an  abstract  layer  between  the  person  who  installs  the 
HLL-TPL  System  and  the  detailed  descriptions  of  the  procedures 
for  converting  the  Local  (or  object)  Dialect  to  the  HPHLL  and 
vice-versa. 


The  description  of  the  Default  Dialect  is  an  integral  part  of  the 
HLL-TPL  Compiler  System.  The  Default  Dialect  is  a  composite  con¬ 
structed  from  the  "most  probable"  dialects.  Those  charged  with  new 
installations  of  the  HLL-TPL  System  use  the  Rule  Modifier  only 
once  to  alter  the  descriptions  of  the  Default  Dialect  to  the  Local 
Dialect.  It  should  be  noted  that  this  approach  simplifies  the  task 
of  installing  HLL-TPL  and  that  it  is  especially  useful  for  compa¬ 
ratively  well  standardized  languages  like  FORTRAN  or  COBOL. 


III.  TABLE  GENERATOR: 


The  Table  Generator  part  of  the  HLn-TPL  Compiler  System  can  be 
viewed  as  a  special  compiler  that  translates  CRDL  descriptions  of 
the  conversion  procedures  to  a  form  that  is  used  by  the  HLL-TPL 
Compiler  to  convert  the  local  dialect  to  HPHLL  and  vice-versa.  The 
Conversion  Rule  Description  Language ,  CRDL,  is  a  meta-language 
tnat  acts  as  a  second  abstract  layer  (between  language  experts 
and  tne  actual  implementation  of  the  HL^-TP^  Compiler). 


The  Table  Generator  translates  the  descriptions  in  the  inter¬ 
mediate  files  to  the  two  so  called  Internal  Tables  (TTM  and  CM) 
that  are  used  by  the  HLL-TPL  Compiler  to  covert  the  local  dialect 
to  iiPHLii  and  visa-versa. 


1 . 


Transportability  is  more  easily  achieved  because  the  TTM  and 
CM  tables  have  uniform  structures  and  both  its  size  and  com¬ 
plexity  are  significantly  reduced  by  describing  the  conversion 
procedures  in  the  form  of  easily  accessed  tables. 


An  efficient  implementation  can  be  realized  by  using  some 
the  many  well  known  optimization  techniques  |_5j* 


of 


3- 


Debugging  and  maintenance  are  simplified  because  the  diffi¬ 
culties  associated  with  debugging  parsing  tables  are  avoided 
[5]- 


The  planned  extension  of  the  TPL  method  for  use  to  convert  any 
dialect  of  any  high  level  language  to  a  standard  language  like  ADA 
are  discussed  in  [4j. 
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IV.  HLL-TPL  COMPILER: 

The  HLL-TPL  Compiler  can  be  viewed  as  a  simulated  compiling 
machine  with  a  dispatcher  and  processors  to  execute  its  own 
instructions.  In  the  Transportability  Testing  (TTM)  and  Compila¬ 
tion  (CM)  Modes  the  HLL-TPL  Compiler  executes  a  conversion  proce¬ 
dure  by  fetching  instructions  one  at  a  time  from  the  TTM  and  CM 
table  respectively  The  sequence  of  operations  is  terminated  when 
the  conversion  procedure  is  completed.  This  kind  of  approach  for 
implementing  the  HLL-TPL  Compiler  offers  the  following  advantages: 
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A  VLSI  Implemen  table  Block  Oriented  Data  Driven  Multiprocessor 

B.  Shirazi  A.R.  Hurson 

Southern  Methodist  University  Pennsylvania  State  University 

ABSTRACT 

The  traditional  parallel  processing  approaches  to  closing  the  computational  gap  have  con* 
sistently  failed  since  parallelism  has  been  forced  through  the  extensions  of  the  sequential  von- 
Neumann  model  rather  than  using  an  inherently  parallel  model.  The  data  driven  model  allows  a 
high  degree  of  parallelism  and  asynchrony  among  its  basic  operators  and  thus,  it  has  been  con* 
sidered  as  a  promising  alternative  to  the  von*Neumann  based  architectures.  This  paper  Bret 
discusses  the  trend  in  the  design  of  the  data  driven  machines  through  a  classification  of  these  sys¬ 
tems.  It  then  addresses  the  design  of  a  processing  module  capable  of  executing  a  program  block 
in  data  driven  fashion.  These  processing  modules  are  used  as  the  building  blocks  for  the  construc¬ 
tion  of  a  block  driven  multiprocessor.  The  organization  of  this  system  matches  the  semantics  of 
the  block  and  data  driven  models  by  allowing  efficient  and  faster  intra  block  as  opposed  to  the 
longer  inter  block  communications.  In  addition  to  the  basic  design,  we  will  address  the  VLSI 
complexity  of  the  system  by  estimating  the  geometry  area  of  the  basic  elements  and  the  timing 
delay  of  the  operations  based  on  the  current  technology. 

I.  Introduction  and  Background 

The  challenge  of  closing  the  computational  gap  has  promoted  the  introduction  of  some  alter¬ 
native  architectures  to  the  von  Neumann  machines.  The  data  driven  model  is  one  the  most 
promising  of  such  alternatives.  In  a  data  driven  environment,  an  operation  is  carried  out,  in 
parallel  with  other  operations,  as  soon  as  its  input  operands  are  available.  The  concept  of  asyn¬ 
chrony  embedded  in  the  definition  of  a  data  driven  architecture  provides  grounds  for  a  high 
degree  of  implicit  parallelism.  In  addition,  the  data  driven  organization  eliminates  the  need  for 
an  updatable  storage,  use  of  identifiers  and  all  of  their  associated  by-products  such  as  global  side- 
effects  and  aliasing. 

Nevertheless,  the  parallel  nature  of  the  data  driven  computation  coupled  with  the  freedom 
from  side-effects,  imply  a  strain  on  the  interconnection  network  of  these  systems.  This  leads  to 
the  increased  cost  and  complexity  of  the  network  as  well  as  erecting  a  potential  bottleneck  due  to 
network  delays.  Therefore,  the  large-grained  data  flow  architectures  have  recently  been  investi¬ 
gated  as  an  alternative  for  their  fine-grained  counterparts.  This  has  led  to  the  development  of  the 
block  driven  architectures  which  explore  the  parallelism  at  the  program  block  level  [  1) . 

Basically,  data  driven  architectures  are  decentralized  control  systems  and  hence,  they  could 
be  classified  as  M1MD  machines.  We  propose  a  classification  which  is  based  on  the  interrelation¬ 
ship  and  communication  among  the  data  menory,  the  instruction  memory,  and  the  processing 
elements.  According  to  this  classification,  the  data  driven  architectures  are  grouped  into  three 
classes.  Class  I,  which  is  representative  of  the  dynamic  data  driven  machines,  uses  a  separate 
data  and  instruction  memory  as  well  as  a  set  of  separate  processing  units.  The  Manchester  and 
the  Irvine  machines  are  examples  of  the  architectures  in  this  class.  They  allow  recursion  and 
parallel  execution  of  different  activations  of  a  block.  However,  some  additional  overhead  is  intro¬ 
duced  due  to  the  manipulation  of  the  color  information  and  the  fact  that  only  one  :opy  of  the 
program  block  is  kept.  The  Class  II,  or  static  data  driven  architectures,  represents  the  unification 
of  the  data  and  the  instruction  memories  in  which  each  instruction  is  self  contained.  The  NOT 
and  the  TI  machines  are  some  examples  in  this  group.  Even  though  the  machines  in  this  class 
have  a  simpler  organization  and  do  not  introduce  additional  overhead,  they  do  not  allow  coex¬ 
istence  of  multiple  activations  of  the  same  block.  Finally,  in  Class  III,  as  a  promising  representa¬ 
tive  of  the  future  data  driven  machines,  the  processing  power  is  incorporated  at  the  memory  level. 
In  other  words,  each  memory  cell  contains  an  instruction  and  has  enough  processing  power  to 
carry  out  the  operation,  when  the  data  is  available.  The  system  has  a  cellular  organization  and 


therefore  it  is  suitable  for  VLSI  implementation. 

This  paper  discusses  the  architectural  aspects  of  a  data  driven  organization  which  is  based 
on  the  characteristics  of  the  Class  III,  while  incorporating  multiprocessing  capabilities  at  the  pro¬ 
gram  block  level  (block  driven).  The  VLSI  complexity  of  the  proposed  architecture  is  discussed 
through  an  analysis  of  the  timing  and  geometry  areas  of  the  building  block  elements. 

f.  The  System  Overview 

A  viable  data  driven  architecture  should  comply  with  the  technological  constraints,  offer  a 
better  performance  for  inherently  parallel  problems,  reduce  fine-grained  communication  delays, 
and  introduce  a  practical  and  effective  solution  for  the  manipulation  of  the  data  structures. 
These  criteria  have  led  us  to  the  introduction  of  the  WDDM,  a  Wafer-icalc  Data  Driven  Multipro¬ 
cessor  (2|.  The  system  consists  of  m  identical  processing  modules  (each  on  a  silicon  wafer),  a 
host,  and  a  data  structure  module.  The  system  units  communicate  through  a  double  star  inter¬ 
connection  network,  with  the  host  and  the  data  structure  modules  as  the  centers  and  the  process¬ 
ing  modules  as  the  orbital  nodes  (Figure  1).  The  two  centers  are  connected  via  a  dedicated  chan¬ 
nel  which  allows  them  to  6hare  the  run  time  system  management  information.  It  should  be  men¬ 
tioned  that  the  choice  of  the  star  network  is  not  fixed  and  we  are  currently  investigating  other 
networks  as  possible  candidates. 

Although  the  star  network  is  potentially  prone  to  creating  a  bottleneck  at  the  center  station, 
we  have  incorporated  many  measures  in  the  architecture  in  order  to  alleviate  the  possible 
bottleneck.  For  example,  a  number  of  independent  dedicated  buffers  are  used  for  simultaneous 
transfer  of  blocks  between  the  host  and  several  processing  modules.  In  addition,  in  order  to 
reduce  the  overhead  of  the  host  in  the  memory  management  operations,  associative  processing  is 
utilized.  This  not  only  allows  fast  parallel  search  of  the  different  tables  which  are  used  for 
memory  management  operations,  it  also  provides  opportunities  for  parallelism  among  different 
associative  memories. 

It  should  be  mentioned  that  the  system  operations  are  based  on  the  dynamic  data  flow  prin¬ 
ciples  by  allowing  simultaneous  execution  of  different  activations  of  the  same  block.  However, 
within  a  block,  the  operations  are  based  on  the  static  data  flow  model.  In  other  words,  coloring 
(labeling)  is  used  at  the  program  block  level  instead  of  the  data  token  level. 

The  data  structure  module  holds  the  data  structures  and  partially  performs  or  initiates  the 
data  structure  operations,  using  the  processing  modules.  An  input  data  structure  is  duplicated  for 
each  function,  thus  eliminating  the  side-effects  among  the  functions.  However,  upon  the  comple¬ 
tion  of  a  function,  the  input  and  intermediate  data  structures  are  destroyed  and  only  a  pointer  to 
the  output  data  structure  is  returned.  Within  a  function,  a  scheme  similar  to  the  I-structure  pro¬ 
posed  by  Arvind  [3]  is  utilized.  Provisions  are  made  to  utilize  processing  modules  to  allow  vector 
operations  on  arrays.  In  addition,  the  interleaved  organization  of  the  data  structure  memory 
allows  simultaneous  access  to  many  elements. 

3.  The  Proponed  Processing  Module 

An  active  program  block  assigned  to  a  processing  module  is  executed  in  data  flow  fashion. 
Therefore,  there  are  two  levels  of  parallelism  during  the  execution.  First,  there  is  a  large-grained 
parallelism  due  to  the  concurrent  execution  of  the  program  blocks.  Second,  there  is  a  fine-grained 
parallelism  due  to  the  data  flow  operations  within  a  processing  module.  In  order  to  reduce  the 
network  delay  of  the  fine-grained  data  flow  parallelism,  the  processing  power  is  distributed  among 
the  memory  or  instruction  cells.  Therefore,  as  much  of  the  processing  as  possible  is  carried  out  in 
the  instruction  cell  itself  and  only  the  result  values  are  routed  to  the  network.  However,  because 
of  the  economic  considerations  and  according  to  the  RISC  (Reduced  Instruction  Set  Computers) 
philosophy,  only  the  simple  but  most  often  executed  instructions  are  executed  by  the  instruction 
cells  and  the  complex  operations  are  sent  to  the  coprocessors. 

Figure  2  depicts  the  general  organization  of  a  processing  module.  The  instructions  of  a  data 
flow  program  block  are  distributed  among  the  Elementary  processing  units  (E- units).  The  simple 


operations  such  as  fixed  point  addition  are  directly  executed  by  the  E-unit,  resulting  in  a  data 
token  which  is  sent  to  the  succeeding  instruction  via  the  sub-net.  The  E-units  are  also  responsible 
for  matching  the  input  data  tokens.  For  more  complex  operations,  such  as  floating  point  division, 
the  E-unit  forms  an  operation  token  and  routes  it  to  the  corresponding  Functional  unit  (F-unit) 
through  the  sub-net.  The  data  tokens  generated  by  the  F-units  are  routed  to  the  destination 
instructions. 

The  tub-net  is  an  arbitration  network  providing  the  intra-module  communication  among 
the  elements  of  a  processing  module.  It  also  connects  a  processing  module  to  the  host  and  the 
data  structure  modules. 

The  Active/ Inactive  Detector  (AID)  unit  keeps  track  of  the  status  of  a  program  block  as  it 
is  executed  by  the  processing  module.  When  a  processing  module  becomes  idle  due  to  a  procedure 
call,  the  status  is  reported  to  the  host  and  the  block’s  image  is  saved  in  the  host.  This  allows  the 
processing  module  to  be  assigned  to  another  enabled  block. 

4.  VLSI  Complexity 

The  proposed  system  is  suitable  for  VLSI  implementation  because  it  has  a  regular  and  cellu¬ 
lar  organization;  i.e.  duplication  of  processing  modules  at  system  level  and  duplication  of  E-units 
and  F-units  at  a  lower  level.  Furthermore,  the  VLSI  I/O  pin  limitation  problem  is  resolved 
through  the  use  of  wafer-scale  integration  for  implementation  of  the  processing  modules.  The 
intra-block  communication  takes  advantage  of  the  on-chip  (on-wafer)  interconnections,  and  the 
inter-block  connections  can  have  large  bandwidths  due  to  the  increased  size  of  the  wafer. 

In  addition  to  utilizing  some  standard  units,  we  have  proposed  a  number  of  special-purpose 
VLSI  components,  such  as  a  systolic  multipler  [4]  (as  an  example  of  an  F-unit)  and  an  associative 
memory  module  [5|  (for  use  in  the  memory  management  operations). 

The  basic  component  of  each  E-unlt  is  the  elementary  processor  which  performs  simple 
logic  and  arithmetic  operations.  In  order  to  accommodate  such  capabilities  into  an  E-unit,  we 
will  take  advantage  of  the  OM2  Data  Path  ALU  |6).  Each  ALU  bit  occupies  an  area  of 
.  1mm  X. 6mm.  Considering  the  area  needed  for  superbuffer  drivers  for  the  ALU  control  lines,  the 
storage  space  for  the  controller,  and  the  I/O  ports,  the  total  E-unit  area  for  a  32-bit  ALU  is 
estimated  at  4mm  X  4mm. 

The  sub-net  consists  of  (log2n)-2  stages  of  standard  2-by-l  arbitration  switches  followed  by 
two  stages  of  2-by-2  square  switches.  The  area  of  an  arbitration  switch,  routing  m  bits  in  paral¬ 
lel,  is  about  30mX  x  100X.  Let  n  be  the  total  number  of  the  E-units,  F-units,  and  the  two  input 
ports  on  a  processing  module.  The  length  of  the  first  stage  of  the  network  will  then  be 
2000X(n/2).  The  network  requires  (iog2n)-2  stages  of  arbitration  and  2  stages  of  square  switches, 
yielding  a  network  width  of  200X(log2n).  With  n=32,  the  network  dimensions  are  32000XX 
1000X.  The  network  delay  for  a  token  of  64  bits  is  10(log2n)-2)+90  ns  if  there  is  no  conflict  with 
the  passage  of  the  other  tokens.  With  n=32,  this  delay  is  120  ns. 

The  result  of  this  analysis  reveals  that  our  proposal  for  implementation  of  the  processing 
modules  on  silicon  wafers  is  well  within  the  limits  of  current  technology.  For  example,  with  a 
conservative  X  parameter  (2.er  (2.5pm),  a  system  of  32  E-units  can  be  easily  implemented  on  a 
wafer.  This  size  is  sufficient  enough  to  handle  the  majority  of  loops  and  small  procedures.  Natur¬ 
ally,  as  the  X  size  is  shrunk  due  to  the  advances  in  technology,  the  number  of  E-units  on  a  wafer 
can  be  increased.  For  example,  with  X=.8  pm,  a  64-E-unit  processing  module  will  be  reasonably 
practical. 

6.  Conclusion 

This  paper  has  introduced  a  new  multiprocessor  system  based  on  the  data  and  block  driven 
computation  principles.  The  system  has  a  cellular  architecture  suitable  for  VLSI  implementation. 
The  wafer-scale  integration  technology  is  used  to  speed  up  the  intra-block  communication,  reduce 
the  cost,  and  improve  the  I/O  pin  limitation  problems.  The  system  has  been  simulated  both 
based  on  a  probabilistic  model  and  an  emulation  model.  The  simulation  results  cannot  be 


presented  due  to  space  limitations.  The  performance  improves  by  increasing  the  number  of  E- 
units  and  processing  modules,  but  saturates  at  some  point  due  to  I/O  delays  and  single  program 
environment  simulation. 
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Introduction 

Recent  advances  in  custom  VLSI  chip  fabrication  technology  have  made  it  economically  feasible  to 
implement  parts  of  a  system  in  hardware.  A  number  of  issues  have  to  be  addressed  if  this  approach 
is  used.  One  of  them  is  the  systematic  derivation  of  a  processing  architecture  from  specification  of 
subsystem.  The  paper  addresses  this  issue.  The  basic  approach  used  is  the  Transformational  or 
Mapping  approach.  Our  version  of  the  approach  differs  from  others  [1-3]  in  several  respects.  After 
producing  an  Algorithm  Model  from  Recurrence  Relations  we  derive  a  Processing  System  Model 
algorithmically.  Interpolation  and  Integration  have  been  used  as  examples. 

Algorithm  Model 

We  propose  a  new  graph  model  called  Compute-Store  Graph  Model  to  describe  computations  being 
studied  in  this  paper.  Our  formalization  of  an  algorithm  is  as  follows: 

Definition  1.  (Compute-Store  Graph) 

A  Compute-Store  graph  is  a  labeled  DAGCQirected  Acyclic  Graph)  and  is  characterized  by  a 
3-tuple  (C,S,E)  where 

(1)  C  is  finite  set  of  Computing  nodes.  Each  computing  node  is  assigned  a  unique  co-ordinate  in 
a  2-D  Euclidean  space.  (Tranformation  1) 

(2)  S  is  a  finite  set  of  Storage  nodes.  With  each  storage  nodes  associated  are  two  attributes  (D,d) 
where  D  is  the  Data  type  and  d  is  the  delay.  Each  storage  node  is  assigned  a  unique  label  from 
the  set  N  of  natural  numbers.  (Transformation  2) 

(3)  E  is  a  finite  set  of  directed  Edges.  Each  incoming  and  outgoing  edges  are  labeled  uniquely 
from  the  set  N. . 

(4)  Vwe  C,  3u,vg  S  :  uwtE  and  wveE 

(5)  Vu,v  €  C  and  u^v,  uv^  E  and  vu/£  E 

(6) VseS,  there  is  at  most  one  ugC  :  su£E. 

Let  ueC ,  A  =  {w€S:wufcE}  and  R  =  {weS:uweE}. 

Definition  2-  (Computable) 

u  is  said  to  be  Computable  if  arguments  are  available  in  elements(storage  nodes)  of  A  and 
elements(storage  nodes)  of  R  are  empty. 

Definition  2-  (one  clock  tick) 

Let  tr  be  the  time  taken  to  receive  arguments  from  the  elements  of  A.  Let  tc  be  the  time  taken  to 
compute  the  function  by  u.  Let  tc  be  the  time  taken  to  store  the  results  in  the  elements  of  R.  Then  t 
is  said  to  be  one  clock  tick  if  t=tr+tc+ts. 

Definition  4-  (Rules  of  Computing) 

(1)  u  is  computable. 

(2)  u  receives  arguments  from  elements(storage  nodes)  of  A,  computes  the  function,  and  stores 
the  result  in  elements(storage  nodes)  of  R,  in  one  clock  tick. 

Model  of  Processing  System 

The  basic  objects  of  the  model  are:  Processing  Elements(PE)  and  Communication  Links(CL).  PE 
performs  the  ncessary  functional  transformation  using  the  data  communicated  to  it  via  CL  through 
input  ports  and  data  in  its  storage  registers.  It  performs  the  functional  transformation  after  the  elapse 
of  specified  units  of  time  contained  in  its  status  register.  The  delay  in  the  activation  of  functional 
transformation  ensures  that  all  the  required  data  are  available.  After  the  transformation  the  resulting 
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data  are  communicated  to  other  PE  via  CL  through  output  ports.  CL  provides  the  facility  for  inter-PE 
communication.  Each  CL  starts  at  an  output  port  and  terminates  at  an  input  port.  It  communicates  data 
of  specified  type. 

In  terms  of  these  basic  objects  a  Processing  System(PS)  is  defined  as  a  2-tuple  (PE, CL),  where  PE 
is  a  finite  set  of  Processing  Elements  and  CL  is  a  finite  set  of  Communication  Links.  PE  is  defined  as 
a  5-tuple  (FT, IP, OP, ST, SR),  where  FT  is  a  finite  set  of  functional  Transform  Units,  IP  is  a  finite  set 
of  Input  Ports,  OP  is  a  finite  set  of  Output  Ports,  ST  is  a  finite  set  of  Status  Registers  and  SR  is  a 
finite  set  of  Storage  Registers.  CL  is  defined  as  a  3-tuple  (OP,EP,EV),  where  OP  is  a  finite  set  of 
Output  Ports,  IP  is  a  finite  set  of  Input  Ports  and  EV  is  a  finite  set  of  Data  Types.  Also 
IP=IPI  U  EPE  where  IPI  is  a  finite  subset  of  Input  Ports  at  the  termination  of  CL  and  IPE  is  a  finite 
subset  of  Input  ports  from  the  Environment  to  PS.  Similarly  OP=OPI  U  OPE. 

Processing  Architectures  for  Interpolation  and  Integration 

Recurrence  relation  for  Neville’s  Interpolation  is  : 

Qi,0  *  f(*i>  ;  1  =  0)1 . ’n 

Qij  =  [  (x-Xj)  *  Qj_  i  j_  i  -  (x-xi.j)*Qij_1  ]  /  (xj.j  -  xj) ;  i  =  l,2,...,n  and  j  =  1,2, . ,i 

The  Compute-Store  graph  for  this  relation  and  the  corresponding  Processing  Architecture  are  shown 
in  Figs.  1  and  3. 

Recurrence  relations  for  Romberg  Integration  are  : 

Ru  =  0.5*(b-a)*[f(a)+f(b)] 

Rk'i  =  0.5*[  Rk_u  +  hk.!  *  SUM  {  f(a+(i-0.5)*hk.1)  }  from  i  =  1  to  2**(k-2)]  ;  k  =  2,3,.., n 
hk  =  b-a/  2**(k-l) 

Ri,j  =  [  Ri-l,j-l  *  (4**(j-l))  *  Rij-i  1  /  (  1  -  (4**(j-D) )  ;  i  =  2,3,...,n  and  j  =  2,3,. ..,i 
The  Compute-Store  graph  and  Processing  Architecture  for  this  relation  are  shown  in  Figs.  2  and  4. 

Mapping  Algorithm 


vm 


The  mapping  process  consists  of  applying  the  following  transformations  between  the  basic  objects  of 
Algorithm  Model  and  Processing  System  Model. 

Transformation  1:  Define  T1  :  C - >  E§  s.t.  Tl(Cy)  =  (i,j) 

Transformation  2:  Define  T2  :  S - >  N  s.t.  T2(sp  =  i  where  i  e  N 

Transformation  3:  Define  PE  =  T3.C  where  T3  is  an  nXn  integer  matrix.  One  feasible  way  of 

choosing  T3  is  :  T3  =  I 

Transformation  4:  Define  T4  :  (S,E) - >  (SR, ST).  Algorithm  A  constructs  T4. 

Transformation  5:  Define  T5  :  (C,S,E)  — >  CL.  Algorithm  B  constructs  T5. 

Transformation  6:  Define  T6  :  (S,E) - >  (IP, OP).  Algorithm  C  constructs  T6. 

Algorithms  A,B  and  C  have  been  described  in  [4]. 

Conclusion 

We  have  shown  that  by  representing  a  Recurrence  Relation  as  a  Compute-Store  Graph  it  is  possible 
to  algorithmically  obtain  a  feasible  Processing  Architecture  in  the  form  of  Processor-Commlink 
Model.  This  architecure  is  solely  based  on  the  characteristics  of  the  Recurrence  Relation.  No  external 
constraints  are  imposed  on  it.  Our  Mapping  Algorithm  produces  a  speed-up  of  O(n)  for  the  example 
architectures.  Several  related  issues  are  being  investigated.  One  of  them  is  obtaining  an  optimal  from 
the  feasible  architecture. 
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It  has  now  been  ten  years  since  the  “data  pull”  nature  of  lazy  applicative  programming  was 
proposed  as  a  solution  to  the  process  fragmentation  problem  of  parallel  processing  [3].  Since  that 
time  there  has  been  considerable  development  of  hardware,  both  pipelined/vector  machines  and, 
more  recently,  cube-  and  switch-connected  engines  composed  of  tens  and  hundreds  of  processors. 
The  style  of  programming  for  the  former  machines  has  not  improved  during  that  time,  and  we  yet 
have  no  languages  suitable  for  taking  appropriate  advantage  of  the  latter. 

At  the  same  time,  the  early  development  of  machines  derived  from  applicative  languages  has 
proceeded  from  combinator-based  hardware  to  more  elaborate  multiprocessor  architectures.  This 
paper  proposes  algorithms  derived  using  applicative  programming,  suitable  for  multiprocessors 
switch-connected  to  a  banked  (self-managing)  heap  memory.  Although  shared-memory  access 
to  A/  locations  requires  time  at  least  logarithmic  in  A/,  we  are  accustomed  to  constant-time  access 
(for  small  M — or  cache),  which  is  finally  disappearing  in  tree-  and  switch-connected  multiprocessor 
proposals.  Moreover,  there  are  several  solutions,  some  in  hardware  [15],  for  distributing  management 
of  a  shared  heap. 

The  Daisy  Project  at  Indiana  has  been  exploring  parallelism  through  implementation  of  and 
practice  in  an  applicative  (or  functional)  programming  language.  One  facet  of  this  effort  is  extending 
this  style,  well  grounded  as  it  is  in  formal  semantics,  back  toward  hardware  in  order  to  discover 
the  essence  of  a  rich  run-time  environment,  suitable  to  achieve  the  efficiency  apparent  in  the  high- 
level  program.  Aiming  at  a  well  known  problem  area,  Matrix  Algebra,  this  overview  demonstrates 
the  power  of  applicative  programming  in  impressing  a  new,  parallel  perspective  on  a  well  studied 
problem. 

Premises: 

Applicative  (functional)  programming  is  a  programming  style  [3,  11]  in  which  the  only  control 
structure  is  the  application  of  function  to  argument  (-list).  There  is  no  rebinding  of  identifiers 
(assignment  statement),  except  through  binding  of  parameters  to  arguments,  and  no  sequential 
flow  except  that  inferred  from  primitive  functions  (such  as  addition  where  the  values  of  all  terms 
are  necessary  before  the  computation  of  their  sum.)  Functional  operations,  like  composition  and 
mapping,  should  allow  for  one  function  application  to  return  several  results,  just  as  it  may  take 
several  arguments  [4]. 

A  significant  feature  of  this  style  is  that  there  is  no  inherent  sequentiality  of  execution.  Relaxed 
(lazy)  evaluation  order  means  that  arbitrary  parallel  evaluation  is  safe.  This  fact  makes  Daisy  most 
suitable  both  for  describing  parallel  algorithms  and  for  describing  parallel  hardware  [7]. 

Once  we  have  the  ability  to  identify  parallelism  within  a  large  function  (program),  there  remains 
the  problem  of  which  processes  to  run  with  only  finitely  many  processors.  While  the  problem  is 
unsolved  in  general,  we  observe  that  it  is  better  to  dispatch  processes  as  high  as  possible  in  the 
structure  of  the  program.  This  “height  heuristic”  is  an  effort  to  create  long-running  processes  in  order 
that  the  overhead  of  process  dispatch  and  recovery  be  amortized  over  as  long  a  parallel  computation 
as  possible.  If  processes  are  only  dispatched  from  points  deep  in  the  structure  of  a  program  (e.g.  only 
in  FORTRAN’S  arithmetic  expressions)  then  they  will  be  too  short-lived  to  justify  their  creation. 
Therefore,  it  is  important  to  lift  parallel-processing  notation  to  as  high  a  level  as  possible,  providing 
it  throughout  the  programming  language,  if  possible. 


Application  to  Matrices: 

If  applicative  programming  is  a  better  style  for  driving  parallel  architectures  than  conven¬ 
tional/iterative  languages  (like  FORTRAN),  then  it  ought  to  be  able  to  offer  some  new  perspective— 
or  even  an  improvement — on  problems  for  which  those  languages  are  thought  to  be  suited.  As  a  test 
of  this  premise,  we  consider  important  problems  from  Matrix  Algebra,  where  pipelined/vector  pro* 
cessors  ought  to  be  optimal. 

The  following  is  a  thumbnail  sketch  of  an  applicative  approach  to  matrix  algebra.  It  b  remarkable 
because  it  demonstrates  that  the  selection  of  a  tree-like  data  structure,  directly  suggested  by  the 
recursive  paradigm  and  contrary  to  most  memory  architecture,  leads  to  new  parallelism,  a  new 
strategy  for  processor  allocation,  and  an  efficiently  stable  algorithm. 

Representation  b  a  most  important  issue.  A  conventional  structure  in  the  highly  recursive  style 
of  functional  languages  b  the  tree.  Thus,  we  are  drawn  to  finite  trees  for  representing  finite  structures: 
specifically,  an  n-dimensional  array  will  be  represented  by  a  2"-ary  tree,  as  in  Figure  1.  The  exact 
size  of  an  array  b  implied  by  data  independent  of  this  structure,  because  a  scalar  b  to  be  interpreted 
as  such  an  array  of  any  size.  That  is,  the  scalar,  3,  may  be  interpreted  as  the  (2-dimensional)  matrix 
of  arbitrary  size  that  b  three  times  the  identity  matrix;  the  identity  matrix  is  1  and  0  is  the  zero 
matrix.  A  matrix  b  either  such  a  scalar  or  it  is  four  quadrants,  each  a  (sub)  matrix. 

Implicit  in  thb  definition  b  that  sparse  matrices  are  represented  sparsely  and,  therefore,  that 
the  space  necessary  to  fill  out  thb  Strassen-like  decomposition  [10]  to  a  large  matrix  of  size  2m  x  2m 
b  negligible.  The  preferred  memory  architecture,  however,  b  a  heap. 

Figure  2  shows  the  process  decomposition  for  conventional  operations  like  matrix  addition  and 
multiplication.  It  exhibits  two  facts:  first  that  task  decomposition  occurs  at  the  root,  into  four 
or  sixteen  large  subprocesses,  which  will  run  for  quite  a  while  amortizing  the  expense  of  their  dis¬ 
patch/recovery;  quite  an  improvement  over  parallelbm  only  at  the  leaves  of  an  expression  tree! 
Second  b  behavior  on  uncovering  sparseness;  sparse  behavior  is  effected  when  the  algorithm  encoun¬ 
ters  a  scalar  0  instead  of  a  pointer  to  a  large  matrix.  When  an  argument  to  addition  or  multiplication 
includes  a  0  (or  1)  quadrant,  then  25%  of  the  effort  b  annihilated  by  borrowing  a  reference  to  pieces 
of  the  other  operand  [13,  14]. 

More  important,  however,  is  matrix  inversion,  or  solving  bilinear  forms.  Figure  3  illustrates 
the  decomposition  for  a  Pivot  Step  at  a  known  position  pivot  position.  The  solution  b  to  return 
the  pivoted  matrix,  and  the  pivot  row  and  column,  each  represented  as  a  binary  tree.  The  problem 
decomposes  four  ways  into  quadrants  identified  as  that  containing  the  (non-zero)  pivot  element  (pie), 
those  coordinating  horizontally  (row)  and  vertically  {col),  and  that  diagonal  (off).  The  algorithm  b 
interesting  [16],  admitting  massive  parallelism  in  the  transformations  of  row,  col,  and  off. 

The  possibility  of  performing  full  row/column  pivoting  is  implicit  with  essentially  no  increase 
in  search  cost.  Suppose  that  before  beginning  the  pivoting  the  tree  structure  had  been  recursively 
traversed  and  annotated  at  each  nonscalar  subtree  with  the  largest  contained  (absolute)  value  and 
with  two  bits  indicating  in  which  quadrant  it  lay.  Sufficient  information  exbts  at  the  root  of  the  tree 
to  initiate  the  first  pivot  step  without  further  search.  Moreover,  those  maxima  may  be  maintained 
by  the  Pivot  Step  transformation,  because  the  information  b  local  to  values/subtrees  being  changed, 
and  because  that  maxima  information  b  static  off  subtrees  unaltered  (because  they  coordinated  on 
zeroes  in  the  pivot  row  or  column)  by  the  transformation.  Thus,  thb  sufficient  information  will  be 
maintained  at  the  root  of  the  tree  to  initiate  each  successive  Pivot  Steps  without  further  search. 
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Figure  2.  Simple  decomposition  or  sum. 
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element;  southwest  quadrant  indicated. 


Session  15C:  Applicative  Language  and 

Data  Fiow  Techniques 


Kim  Gostelow 

GE  Research  and  Development  Center 


IMPLEMENTING  LOGICAL  VARIABLES 
ON  A  GRAPH  REDUCTION  ARCHITECTURE1 


Gary  Lindstrom 

Department  of  Computer  Science 
University  of  Utah 
Salt  Lake  City,  Utah  84112 

1.  Functional  and  Logic  Programming 

Logical  variables  offer  a  semantic  meeting  ground  between  functional  and  logic 
programming  languages  [7].  From  the  perspective  of  functional  programming,  logical 
variables  present  a  radically  different  basis  for  information  flow.  In  particular, 

*  logical  variables  are  not  directly  bound  by  uniquely  determined  expressions,  in 
contrast  to  the  customary  dataflow  orientation  of  functional  languages; 

*  rather,  they  denote  values  determined  by  the  intersection  of  successively 
applied  constraints. 

In  a  functional  programming  language  with  logical  variables,  this  difference  can  permit 
novel  effects  otherwise  quite  difficult  to  obtain.  These  include: 

*  an  action  at  a  distance  effect,  whereby  certain  kinds  of  "side-effects"  can  be 
achieved  within  an  applicative  setting  (the  Prolog  "difference  list"  trick  for 
constant  time  list  concatenation  is  a  familiar  example); 

*  a  stronger  object  orientation,  under  which  logical  variables  play  the  role  of 
mutable  shared  entities,  and 

*  the  use  of  this  object  orientation  as  a  basis  for  isotropic  process 
communication  and  synchronization,  as  in  Concurrent  Prolog  [8]. 

2.  Reduction  Architectures 

Graph  reduction  offers  many  advantages  as  a  strategy  for  executing  applicative 
programs.  These  include: 

*  very  flexible  patterns  of  concurrency  and  load  distribution  on  parallel 
architectures; 

*  clean  support  for  infinite  and  cyclic  data  structures  through  demand 
evaluation  [1]; 
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*  good  program  and  data  locality  effects; 

*  the  support  of  recursion  and  iteration  (via  tail  recursion)  without  the  need  for 
explicit  token  objects  or  "color"  tags  and  associative  memories  for  their 
assembly  into  operator  "firing  sets",  and 

*  the  ability  to  incorporate  a  form  of  data-drive  as  a  submode. 

The  principal  prices  to  be  paid  for  these  advantages  are  potentially  costly  storage 
management,  and  the  added  overhead  of  demand  propagation.  A  good  survey  of 
reduction  architectures  may  be  found  in  [9]. 

3.  FGL+LV  on  Rediflow 

We  have  developed  a  strategy  for  implementing  FGL+LV  [5],  a  functional  language  with 
logical  variables,  on  the  Rediflow  multiprocessing  graph  reduction  architecture  [2],  The 
aspects  of  logical  variables  receiving  special  consideration  include: 

a.  parallel  unification,  especially  proper  treatment  of  indeterminate  behavior,  e  g. 
mutual  exclusion  on  variable  binding  [3]; 

b.  variable  binding  through  emulated  graph  node  merging; 

c.  exploitation  of  two  levels  of  demand:  assertive  (during  unification),  and 
non-assertive  (ordinary  "read-only"  usage),  and 

d.  avoidance  of  meaningless  cyclic  variable  bindings. 

The  existing  base  language  implementation  is  smoothly  extended,  with  a  word  size 
increase  of  only  one  bit  (needed  to  implement  two  levels  of  demand).  Lazy  evaluation  in 
the  base  language  is  retained,  except  that  actual  parameters  are  now  made  strict  to  one 
level  of  evaluation  This  requirement,  overlooked  in  [5],  is  argued  to  be  semantically  and 
operationally  inescapable  in  a  functional  language  with  logical  variables.  It  also  provides 
insight  into  the  vexing  problem  of  how  to  apply  the  occur  check  in  a  language  with 
infinite  data  objects  A  novel  technique  for  merging  cyclic  lists  is  used  to  implement 
logical  variable  binding  in  a  distributed  manner  without  locking  or  busy  waiting. 

4.  Extensions 

Our  work  thus  far  has  focused  on  the  adoption  of  logical  variables  in  a  deterministic 
setting,  i.e.  without  support  for  backtracking  or  OR-parallel  search  [4],  This  single 
extension  brings  into  the  realm  of  functional  programming  such  elegant  techniques  as  the 
Milner  algorithm  for  polymorphic  type  checking  (6]  And,  while  the  absence  of  OR- 
parallelism  may  be  lamentable,  in  partial  compensation  we  retain  functional 
programming's  execution  directional /ty,  which  offers  a  clean  solution  to  the  AND-parallel 
control  problem  which  plagues  pure  logic  programming. 

Continuation  work  is  underway  aimed  at  augmenting  this  implementation  approach  to 
embrace  OR-parallelism 
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For  future  symbolic  multi-processing,  networks  of  several  hundred  processors 
are  anticipated,  each  consisting  of  10K-20K  devices.  At  1984  commercial  densities 
this  equates  to  4  to  8  devices  per  chip.  According  to  anticipated  1988  densities  [1], 
this  can  be  expected  to  increase  to  as  many  as  128  per  chip.  We  are  quickly 
approaching  a  time  when  the  technology  will  support  large  scale  symbolic  parallelism, 
but  only  if  we  can  find  ways  to  efficiently  implement  it.  To  help  achieve  this  goal, 
there  is  a  need  to  investigate  moderately  sized,  computationally  versatile  processors 
designed  for  efficient  inter-processor  communication  within  a  multi-processor 
organization. 

There  exists  a  general  trend  at  present  toward  language  directed  architectures. 
RISCs,  CISCs,  and  direct  execution  machines  have  all  attacked  the  uni-processor  design 
problem  with  various  language  directed  goals. 

RISC  architectures  are  designed  with  the  goal  of  CPU  control  simplification. 
This  simplification  can  be  made  based  upon  instruction  frequencies  in  compiled 
languages.  Here,  more  frequent  instructions  are  implemented  with  fewer  clock  cycles. 
A  simpler  control  section  also  allows  for  additional  on-chip  register  space.  CISC 
architectures  take  a  contrary  approach,  that  of  CPU  control  enhancement.  This 
enhancement  is  less  based  upon  frequencies  of  common  instructions  than  it  is  upon 
building  new  ones.  Thes  new  instructions  are  intended  to  accomplish  in  one  main 
memory  access  what  would  have  otherwise  required  several.  Both  RISC  and  CISC 
approaches  share  the  goal  of  providing  more  efficient  support  of  compiled  high  level 
languages.  Whether  one  does  so  better  than  another  is  a  subject  of  debate.  It  is  not 
the  intention  of  this  paper  to  take  a  stand  on  that  issue.  The  point  here  is  that 
both  approaches  share  a  design  philosophy  based  upon  some  aspect  of  abstract 
programming  language  semantics. 

Direct  execution  processors  discard  the  conventional  RISC  and  CISC  machine 
instruction  set  notion  for  the  direct  representation  of  a  high  level  language.  Once 
again,  the  idea  is  to  construct  hardware  specifically  suited  to  an  abstract 
programming  language.  High  performance  Lisp  machines  have  been  commercially  available 
for  some  time  now.  At  least  eight  commercial  development  efforts  for  machines 
directly  executing  Ada,  Prolog,  Forth,  Lisp  and  other  languages  are  under  way. 
Impressive  performance  measurements  have  been  reported  [3]. 

Functional  programming  architectures  represent  one  language  directed  approach 
to  multi-processing.  Functional  languages  are  a  subclass  of  declarative  programming 
languages,  of  which  Prolog  is  perhaps  the  best  known  member.  Declarative  languages 
provide  flexible  expression  of  general  purpose  algorithms,  while  implicitly  expressing 
parallelism.  The  interested  reader  may  find  a  useful  development  on  functional 
languages  in  [4],  Informative  surveys  of  architectures  influenced  by  this 
programming  paradigm  may  be  found  in  [6]  and  [8], 

Several  aspects  affect  the  design  of  a  multiprocessor  for  functional 
programming  languages.  These  include  the  language  itself,  the  execution  algorithm, 
and  the  hardware  organization. 

To  make  effective  use  of  current  and  anticipated  technology,  it  is  felt  that 
the  selected  machine  language  must  have  a  simple  basis,  be  expressive,  and  be 
extendable.  If  the  language  starts  out  simple,  then  less  hardware  area  will  be 
expended  to  implement  the  bare  essentials.  Furthermore,  the  language  will  not  be 
useful  in  a  general  purpose  sense  if  it  is  not  expressive  and  extendable.  Berkling's 
lambda-reduction  language  [2]  and  Turner's  combinators  [7]  have  been  considered  by  the 
authors  for  kernel  languages.  Lambda  reduction  languages  are  composed  of  lambda 


calculus  expressions  having  string  based  reduction  semantics.  Turner's  combinator 
language  is  similar,  except  that  it  has  graph  based  reduction  semantics,  and  makes  use 
of  special  purpose  functions  called  combinators.  Berkling's  lambda  reduction  language 
and  its  derivatives  have  the  advantage  of  being  more  readable,  at  the  cost  of  the 
extra  complexity  due  to  bound  variables.  Combinators,  on  the  other  hand,  sacrafice 
readability  for  a  simpler  implementation  requiring  no  bound  variables.  Lambda- 
reduction  languages  are  the  current  focus  of  the  authors'  work  because  of  their 
similarity  to  Lisp.  Combinators  could  play  an  important  future  role  if  they  promise 
superior  multiprocessing  performance  or  hardware  simplification  is  required.  It  is 
expected  that  once  more  is  understood  about  implementing  these  two  languages,  then 
others  such  as  functional  Lisp  dialects,  FFP,  and  Hope  may  be  considered. 

The  execution  algorithm  specifies  hardware  state  transitions  in  terms  of 
machine  language  syntax  and  semantics.  Internal  representation,  and  the 
interpretation  method  each  play  a  role  in  the  determination  of  an  execution  algorithm. 

Internal  representation  is  the  way  machine  language  is  encoded  to  represent 
part  of  the  hardware  state.  Virtually  all  computing  machines  use  either  a  string 
based  representation  or  one  that  is  graph  based.  Conventional  instruction  sets  use  a 
combination  of  the  two.  In  string  based  systems,  logically  adjacent  tokens  are 
physically  adjacent  in  machine  memory.  In  graph  based  systems,  logically  adjacent 
tokens  are  linked  by  indirect  references  using  physical  addresses,  so  need  not  be 
physically  adjacent.  String  based  program  representations  are  widely  recognized  to 
have  an  undesirable  difficulty  with  sequential  evaluators.  To  update  a  reference  in  a 
string  representation,  all  occurances  of  that  reference  within  the  string  must  be  re¬ 
written.  This  can  require  scanning  the  entire  program  string  —  clearly  an 
undesirable  action  in  a  sequential  evaluator.  Graph  representations  avoid  this 
problem  by  allowing  each  reference  to  point  to  the  same  value,  so  only  one  update  need 
be  performed  to  satisfy  all  references.  By  exploiting  this  property,  graph 
representations  eliminate  the  need  to  re-evaluate  sub-expressions,  and  are  more 
compact.  These  are  all  useful  attributes  for  sequential  evaluation.  However,  complex 
memory  management  requirements  (including  garbage  collection),  and  the  "fine-grained 
violation"  of  the  Church-Rosser  property  make  graph  based  representations  less 
desirable  in  a  multi-processor.  String  representations  have  a  predictble  information 
flow,  and  ability  to  take  advantage  of  simple  memory  management  schemes.  For  these 
reasons,  it  is  felt  that  string  or  at  least  predominantly  string  based  representations 
should  be  considered  for  multi-processor  implementations. 

Closely  related  to  internal  representation  is  the  issue  of  how  a  given 
representation  is  interpreted.  Interpretation  by  evaluation  implies  program  values 
are  combined  to  produce  a  result.  Interpretation  by  reduction  implies  that  the 
program  string  is  repetitively  re-written  to  obtain  a  result.  It  is  felt  that 
reduction  offers  a  better  alternative  for  a  multi-processor  because  state  and  control 
information  can  flow  as  a  unified  stream  between  processing  elements. 

Several  hardware  organizations  are  under  consideration  for  reduction  based 
multiprocessing.  A  simple  linear  organization,  for  example,  is  one  under 
investigation.  A  processing  element  of  this  array  uses  string  reduction.  It  accepts 
input  tokens,  expecting  a  left-to-right  traversal  of  the  program  expression's  internal 
representation.  At  any  moment,  it  is  responsible  for  the  reduction  of  a  single  sub¬ 
expression. 

Since  the  current  target  is  Berkling's  lambda  reduction  language,  the  program 
string  consists  of  a  sequence  of  atoms  and  constructors.  Constructors  combine 
expressions  to  form  larger  ones,  in  a  way  similar  to  what  CONS  does  for  lists  in  Lisp. 
Constructors  also  act  as  special  operators  in  the  languages.  The  two  most  important 
constructors  are  lambda  and  apply. 

If  a  lambda  constructor  is  encountered,  then  a  beta-reduction  based  on  the 
beta-conversion  operator  of  lambda  calculus  is  performed.  If  an  application 
constructor  (applicator)  is  recognized,  tokens  are  absorbed  until  either  another 
constructor  is  encountered,  or  the  end  of  the  reducible  expression  is  reached.  If 
another  constructor  is  encountered  after  the  applicator,  all  tokens  absorbed  to  that 


point  are  passed  to  the  output,  that  information  becomes  available  to  a  neighbor 
processing  element,  and  interpretation  proceeds  with  the  new  constructor.  If  no 
additional  constructors  are  found,  that  means  the  innermost  expression  has  been 
reached  and  a  function  application  is  performed.  Lambda  constructors  follow  an 
outermost  reduction  rule,  and  applicators  follow  an  innermost  rule.  The  innermost 
function  applications  are  recognized  because  the  processor  only  remembers  information 
since  the  last  applicator.  Only  outermost  lambda  reductions  are  performed  by  an 
element  because  sub-constructors  are  ignored.  For  a  single  traversal  or  the  program 
representation,  either  all  innermost  function  applications  or  all  outermost  lambda 
expressions  will  be  reduced. 

A  linear  array  of  such  processing  elements  yields  a  network  similar  to  the 
Newcastle  reduction  machine  [5],  It  differs  in  that  the  program  expression  flows  in 
one  direction,  helping  simplify  the  processing  element  and  inter-processor 
communications.  It  is  expected  that  a  processing  element  having  a  useful  set  of 
functional  primitives  will  require  10K  to  20K  devices  for  control  and  local  memory. 
As  a  result,  multi-processor  reduction  machines  based  upon  these  ideas  are  expected  to 
offer  a  viable  approach  for  future  symbolic  processing. 
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Abstract 

This  paper  argues  that  massive  parallelism  in  vector  computation  can  be  exploited 
effectively  utilizing  data  flow  principles  in  future  generation  computers.  The  key  is  to 
organize  the  data  flow  machine  program  graph  such  that  array  operations  can  be  fully 
pipelined.  The  applicative  nature  of  the  data  Sow  graph  model  allows  flexible  scheduling 
of  the  execution  of  enabled  instructions  in  the  pipeline  data  flow  programs.  Accordingly, 
program  transformation  can  be  performed  on  the  basis  of  both  the  global  and  local  data 
flow  analysis  to  generate  efficient  pipelined  data  flow  machine  code.  A  pipelined  code 
mapping  scheme  for  transforming  array  operations  in  high-level  language  programs  into 
pipelined  data  flow  machine  programs  is  described. 

1  Massive  Parallelism  in  Array  Computation 

A  major  driving  force  in  the  development  of  high-performance  computers  has  been  scientific 
computation.  The  kernels  of  such  computations  typically  are  expressed  in  linear  algebra 
with  all  data  structured  as  elements  of  arrays.  In  the  computation,  the  bulk  of  the  elements 
of  an  array  are  processed  in  a  regular  and  repetitive  pattern  through  the  different  phases  of 
execution.  For  example,  many  applications  take  the  form  of  computing  successive  states  of 
a  system  represented  by  physical  quantities  on  an  Euclidean  grid  in  two  or  three  dimensions, 
and  the  new  values  of  each  grid  point  may  be  computed  independently.  Thus,  the  degree 
of  concurrency  is  often  at  least  equal  the  number  of  the  grid  points  (for  a  100x100x100 
case,  the  parallelism  will  be  well  over  a  million!).  Therefore,  the  efficient  mapping  of  the 
massive  parallelism  of  array  computation  into  machine-level  code  structure  has  been  a 
major  consideration  in  the  design  of  high-performance  computer  architecture  as  well  as  its 
program  trams. arming  compilers. 

2  Fine-grain  Parallelism  and  Array  Computation 

2.1  Pipelining  of  Data  Flow  Programs 

Fine-grain  parallelism  exists  in  two  forms  in  a  data  flow  machine  level  program,  as  shown 
in  Figure  1,  which  consists  of  seven  actors  divided  into  four  stages.  In  Figure  1  (a),  actors 
1  and  2  are  enabled  by  the  presence  of  tokens  on  their  input  arcs,  and  thus  can  be  executed 
in  parallel.  Parallelism  also  exists  between  actors  3  and  4,  and  between  actors  5  and  6. 
In  static  data  flow  architecture,  we  can  arrange  the  machine  code  such  that  successive 
computations  can  follow  each  other  through  one  copy  of  the  code.  If  we  present  a  sequence 
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of  values  to  the  inputs  of  the  data  flow  graph,  these  values  can  flow  through  the  program 
in  a  pipelined  fashion.  In  the  configuration  of  Figure  1  (b),  two  sets  of  tokens  are  pipelined 
through  the  graph,  and  the  actors  in  stage  I  and  3  are  enabled  and  can  be  executed 
concurrently.  Thus,  both  forms  of  parallelism  are  fully  exploited  in  the  graph. 

2.2  Data  Flow  Language*  and  Array  Computations 

The  use  of  data  flow  languages  such  as  Val  [1]  encourages  an  applicative  style  of  program¬ 
ming  which  does  not  depend  on  the  von  Neumann  style  of  machine  program  execution. 
The  basic  operations  of  the  language,  including  operations  on  arrays,  are  simple  functions 
that  map  operands  to  results.  Data  dependencies,  even  those  involving  arrays,  should  be 
apparent.  Since  our  major  concern  is  how  to  utilize  the  regularity  of  array  operations 
in  the  source  program,  we  concentrate  on  two  array  creation  constructs  -  the  forall  and 
for-construct  expressions. 

The  forall  construct  allows  the  user  to  specify  the  construction  of  an  array  where 
similar  independent  computations  can  be  performed  on  all  elements  of  the  array.  Figure  2 
(a)  is  an  expression  which  defines  a  one-dimensional  array  X  from  an  input  array  A.  The 
for-construct  expression,  proposed  as  a  special  case  of  the  Val  iteration  construct,  is  used 
to  specify  construction  of  an  array  where  certain  forms  of  data  dependencies  exist  between 
its  elements.  Figure  2  (b)  is  a  for-construct  expression  which  constructs  an  array  X 
based  on  the  first-order  linear  recurrence,  using  array  A  and  B  as  parameter  arrays. 

The  two  constructs  provide  a  means  to  express  array  construction  operations  of  desired 
regularity  without  using  array  append  operations.  Expressions  based  on  these  constructs 
are  the  major  code  blocks  studied  in  this  paper. 
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3  Pipelining  of  Array  Operations 

One  objective  of  the  machine  code  mapping  scheme  for  static  data  flow  computers  is  to 
generate  code  which  can  be  executed  in  a  pipelined  fashion  with  high  throughput.  The 
pipeline  must  be  kept  busy  j'  computation  should  be  balanced  and  no  branch  in  the 
pipe  permitted  to  block  the  data  flow.  Furthermore,  computation  resources  should  be 
efficiently  utilized.  In  particular,  the  usage  of  storage  for  arrays  is  important  because  the 
user  program  usually  contains  vast  amounts  of  array  data  to  be  processed. 

In  a  data  flow  computation  model,  an  array  value  can  be  regarded  as  a  sequence  of 
element  values  carried  by  tokens  transmitted  on  a  single  data  flow  arc.  In  Figure  1  (c), 
the  tour  input  arcs  are  presented  with  four  input  arrays  A,  B,  C,  D;  all  are  spread  in  time. 
Thus,  the  graph  can  be  fully  pipelined. 

We  can  observe  that  each  actor  in  Figure  (b)  is  performing  a  vector  operation,  e.g.,  ac¬ 
tor  I  -  vector  addition,  actor  2  -  vector  subtraction,  etc.,  a  total  of  seven  vector  operations. 
However,  unlike  the  vector  operations  usually  supported  in  conventional  vec*or  processors, 
there  is  no  requirement  that  the  activities  of  one  such  vector  operation  be  continuously 
processed  by  one  or  a  group  of  dedicated  function  units  in  the  processor.  The  applicative 
nature  of  the  data  flow  graph  model  allows  flexible  scheduling  of  the  execution  of  enabled 
actors  in  the  pipeline.  In  fact,  an  ideal  data  flow  scheduler  (with  a  sufficiently  large  data 
flow  computer)  will  execute  each  actor  as  soon  as  its  input  data  become  available.  As  a 
result,  the  activities  of  the  seven  vector  operations  overlap  each  other,  performing  opera¬ 
tions  on  different  elements  of  different  arrays  concurrently.  Therefore,  massive  parallelism 
of  vector  operations  can  be  effectively  exploited  by  a  data  flow  computer  .n  a  fine-grain 
manner:  the  scheduling  of  the  physical  function  units  and  other  resources  for  sustaining 
such  vector  operations  are  totally  transparent  to  the  user. 
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Figure  3:  A  group  of  code  block* 


4  Program  Mapping  and  Optimization  Techniques 

We  have  developed  a  pipelined  code  mapping  scheme  which  concentrates  on  the  analysis 
and  handling  of  the  two  types  of  code  blocks  [6j.  It  is  essentially  a  two-step  process.  The 
first  step  consists  of  the  application  of  a  set  of  basic  mapping  rules  which  can  translate 
the  code  blocks  into  pipelined  data  flow  graphs.  In  this  step,  the  conceptual  model  of 
arrays  in  the  source  program  -  i.e.  the  input  and  output  arrays  as  seen  by  each  code 
block  -  remains  unchanged,  but  the  array  operations  are  translated  into  corresponding 
data  flow  actors  in  the  result  graph.  The  second  step  consists  of  the  application  of  a  set 
of  optimisation  procedures  which  can  remove  the  array  actors  from  the  result  graphs  of 
step  I  and  replace  them  with  ordinary  graph  actors.  Thus,  the  links  between  code  blocks 
become  ordinary  arcs  of  a  data  flow  graph.  The  result  graph  for  a  pair  of  producer  and 
consumer  code  blocks  may  be  executed  concurrently,  both  in  a  pipelined  fashion,  without 
involving  array  operations. 

5  Summary 

Based  on  the  pipelined  code  mapping  strategy,  the  performance  analysis  of  a  recently 
proposed  data  flow  supercomputer  architecture  consisting  of  256  processors  has  reported 
promising  expectation  in  a  number  of  benchmark  problems,  notably  the  global  weather 
simulation  model  [3],  the  three  dimensional  aerodynamic  simulation  problem  [4],  and  sev¬ 
eral  benchmarks  studied  in  [7],  It  remains  to  be  seen  how  broadly  applicable  are  such  code 
mapping  scheme  in  data  flow  supercomputing. 
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SUMMARY 


An  initial  implementation  of  an  interactive  Automated  Data  Flow 
Diagram  Verification  system  (ADFDV)  is  described.  ADFDV  is  a 
component  of  the  larger  sytem  called  System  Analyst's  Apprentice 
(SAA) .  SAA  is  being  developed  to  assist  an  analyst  in  different 
phases  of  Computer  Information  System  (CIS)  design  and  development. 
The  ADFDV  is  a  tool  for  mechanical  checking  of  Data  Flow  Diagrams  for 
correctness.  The  philosophy  behind  the  system  is  to  assist  the 
analyst  in  the  error  prone  job  of  the  verification  of  Data  Flow 
Diagrams  for  large  programming  projects. 

Structured  Analysis  and  Design  techniques  have  been  used  to  avoid 
errors  that  may  have  an  impact  on  the  quality  of  final  CIS  design. 
First,  in  early  1970s,  these  techniques  were  successfully  applied  to 
the  development  of  highly  structured  programs.  In  late  1970s  Edward 
Yordon  and  Larry  Constantine  first  described  a  structured  process- 
oriented  method  for  system  analysis  and  design.  This  method  is  based 
on  functional  system  decomposition  and  stresses  flow  and 
transformation  of  data  in  the  computer  system.  It  uses  diagramming 
techniques  for  system  design  specification.  Data  Flow  Diagrams  are 
the  focal  point  of  those  techniques.  They  are  used  as  a  formal 
descriptive  language  as  well  as  informal  means  of  communication 
between  system  developers  and  future  system  users.  They  are  used  to 
describe  the  system  on  very  high  logical  level  as  well  as  on  the  low 
physical  level. 

The  system  analysis  and  design  method  first  proposed  by  Yordon  and 
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Constantine  and  later  improved  by  others  is  well  known  under  the  name 
of  Data  Flow  Analysis  Method  (DFAM)  and  has  been  widely  accepted  by 
the  IS  professionals.  The  purpose  of  the  present  work  is  to  develop  a 
prototype  expert  system.  System  Analyst's  Apprentice  (SAA) ,  to  assist 
the  analysts  in  the  process  of  application  design  and  development 
through  DFAM  techniques.  The  gratest  advanatage  of  the  system  comes 
from  the  fact  that  it  will  free  the  analyst  from  a  great  amount  of 
tedious  and  low  level  work.  That  in  turn  will  give  him  the  oportunity 
to  direct  his  attantion  to  higher  level  problems. 

The  current  system  implementation  is  based  on  the  formalization  of  the 
primitives  for  the  interactive  design  of  DFDs  and  the  rules  for  DFD 
verification  using  knowledge  representation  language  MRS  as  a  tool. 
New  primitives  and  new  rules  can  be  added  to  the  system.  The  greatest 
advantage  of  ADFDV  can  be  realized  when  interactive  development  of  a 
Data  Flow  Diagram  is  done  with  the  verification  option  running  in  the 
background.  The  design  and  implementation  of  ADFDV  serves  as  a 
testbed  for  the  design  of  System  Analyst's  Apprentice. 

Keywords:  Expert  Systems,  Information  Systems  Analysis,  Information 
Systems  Design,  Data  Flow  Analysis,  Data  Flow  Diagram  Verif ication. 
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THE  APSA  PROJECT 

The  Applicative  Programming  System  Architecture  (APSA)  research  project  [2,  3,  4,  5]  is  developing 
new  ways  to  use  finely  grained  parallel  architectures  to  support  applicative  programming  languages.  The 
goal  is  a  parallel  machine  that  is  easily  programmable  with  a  LISP-like  language.  The  programmer  does 
not  need  to  control  the  parallelism  explicitly.  Instead,  the  system  provides  a  number  of  rich  types  of  data 
structure  with  parallel  operations  that  perform  complex  operations  on  the  data  structures.  Some  of  these 
data  structures  are  used  by  the  language  interpreter,  and  others  are  available  to  the  programmer.  Even 
a  programmer  who  does  not  directly  use  any  of  the  parallel  data  structures  will  benefit  from  the  system 
because  the  language  interpreter  itself  usee  parallel  data  structure  operations  to  run  faster.  This  approach 
ensures  that  programs  can  actually  make  use  of  the  architecture’s  power  and  it  greatly  eases  the  burden 
of  writing  parallel  programs. 

In  a  conventional  computer,  the  CPU  can  issue  two  “instructions9  to  the  memory:  fetch  and  store. 
Programmers  must  implement  all  their  data  structures  and  algorithms  using  just  these  two  operations, 
along  with  the  ability  to  do  address  arithmetic  in  the  processor.  APSA  consists  of  a  processor  connected 
to  a  heap  memory  which  executes  about  30  data  structure  instructions  in  addition  to  fetch  and  store.  Some 
of  these  heap  memory  instructions  can  do  work  that  requires  a  loop  if  only  fetch  and  store  are  available  — 
in  a  constant  number  of  cycles.  Several  examples  of  APSA  data  structure  operations  are  given  below. 

Most  approaches  to  multiprocessing  view  a  computation  as  a  given  set  of  instructions  that  must  be 
executed,  and  they  try  to  partition  the  instructions  into  subsets  that  can  be  executed  in  parallel  on  separate 
processors.  In  contrast,  APSA  uses  parallelism  to  reduce  the  number  of  instructions  that  must  be  executed, 
and  it  reduces  the  time  complexity  of  several  key  algorithms  by  O(n),  where  n  measures  the  size  of  the 
problem. 

APPLICATIVE  LANGUAGES  AND  PARALLELISM 

In  most  recent  research  on  parallel  computing,  the  underlying  language  is  imperative  (e.g.,  Fortran), 
and  the  underlying  architecture  comprises  conventional  processors,  memories,  and  switches  organized  as 
a  coarsely  grained  multiprocessor  or  vector  processor.  These  assumptions  limit  the  amount  of  parallelism 
that  can  be  obtained  because: 

•  Imperative  languages  place  unnecessary  constraints  on  the  order  and  concurrency  of  primitive  opera¬ 
tions. 

•  Systems  built  from  conventional  processors  and  memories  cannot  fully  exploit  the  fine  grain  parallelism 
that  VLSI  hardware  makes  feasible. 

APSA  produces  parallelism  through  an  applicative  language  running  on  an  architecture  with  an  extremely 
fine  grained  mixture  of  logic  and  storage.  This  serves  several  purposes,  discussed  below:  (I)  APSA  makes 
the  speed  of  an  applicative  program  comparable  to  the  speed  of  an  imperative  program  by  making  the 
primitive  applicative  operations  fast,  (2)  APSA  can  perform  independent  reductions  ic  parallel,  indicating 
that  it  may  be  able  to  exploit  the  high  level  parallelism  inherent  in  applicative  languages,  and  (3)  the 
APSA  hardware  uses  tiie  ability  of  VLSI  to  mix  logic  and  storage  in  regular  layouts. 

The  main  drawback  of  applicative  languages  is  that  they  are  usually  slow,  because  conventional  data 
structures  based  on  fetch  and  store  cannot  support  many  primitive  applicative  operations  efficiently.  For 
example,  a  FORTRAN  program  can  fetch  the  value  of  a  variable  simply  by  executing  a  fetch  on  a  conven¬ 
tional  machine,  but  fetching  the  value  of  a  variable  in  an  applicative  language  is  much  more  complex  and 
usually  requires  iteration.  This  phenomenon  creates  the  illusion  that  the  applicative  language  is  inherently 
slower  than  FORTRAN,  but  the  real  problem  is  just  a  mismatch  between  the  language  and  conventional 
architectures.  APSA  uses  parallel  data  structure  operations  to  make  the  primitives  for  an  applicative 
language  execute  in  a  small  constant  time. 
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As  it  runs,  an  applicative  program  generates  many  expressions  whose  evaluation  cannot  affect  the 
execution  of  other  expressions.  Several  ‘parallel  reduction  machines”  have  Seen  proposed  to  exploit  this 
inherent  parallelism.  In  many  cases  APSA  is  able  to  do  parallel  reductions,  and  further  research  in  this 
area  looks  promising. 

Since  it  performs  data  structure  algorithms  in  the  heap  memory  hardware,  APSA  combines  substantial 
processing  logic  with  every  memory  word.  This  organization  takes  advantage  of  the  properties  of  VLSI 
circuits:  a  large  heap  memory  consists  of  a  simple  regular  pattern  of  memory  word  components.  In 
addition,  the  heap  memory  could  use  wafer  scale  integration  because  of  the  simple  interconnection  of  its 
components.  A  preliminary  VLSI  design  of  the  heap  memory  is  in  progress. 

PARALLEL  DATA  STRUCTURE  ALGORITHMS 

Many  key  algorithms  for  implementing  applicative  languages  are  generally  considered  to  be  inher¬ 
ently  sequential.  For  example,  indexing  into  a  linked  list,  marking  accessible  words  in  a  garbage  collector, 
managing  continuations  and  maintaining  variable  environments  are  all  crucially  important  for  applica¬ 
tive  language  implementation  —  yet  these  are  implemented  sequentially  even  in  proposed  multiprocessors. 
APSA  implements  all  these  operations  quickly  by  providing  special  instructions  in  the  heap  memory  that 
use  hardware  parallelism  instead  of  software  iteration. 

The  implementation  of  a  combined  list/vector  data  structure  illustrates  how  APSA  works.  Vectors  are 
normally  implemented  by  storing  adjacent  vector  elements  in  storage  locations  with  consecutive  addresses. 
This  allows  the  address  generation  hardware  in  conventional  memories  to  access  an  arbitrary  vector  element 
directly.  However,  it  is  very  expensive  to  insert  or  delete  an  element  in  the  middle  of  the  vector,  because 
that  requires  recopying  much  of  the  vector  in  order  to  make  space  for  the  insertion  or  to  fill  up  the  hole 
left  by  the  deletion.  The  properties  of  linked  lists  are  just  the  opposite:  ‘indexing”  into  the  middle  of 
a  linked  list  is  slow,  because  it  requires  a  loop  that  follows  pointers  from  one  element  to  the  next,  but 
insertion  and  deletion  are  fast  because  they  require  only  a  few  pointer  manipulations.  Programmers  know 
how  to  use  vectors  and  linked  lists  effectively,  but  they  also  “know”  that  a  data  structure  cannot  combine 
the  advantages  of  both. 

The  APSA  heap  memory  can  usually  store  a  linear  data  structure  compactly,  similarly  to  the  con¬ 
ventional  vector  representation.  It  can  index  any  element  directly,  and  it  can  also  perform  an  arbitrary 
insertion  or  deletion  in  constant  time  by  moving  as  many  data  elements  as  necessary  out  of  the  way  in 
parallel.  This  provides  the  programmer  with  an  extremely  powerful  data  structure.  Instead  of  thinking 
about  how  to  schedule  multiprocessors,  the  programmer  thinks  about  how  to  use  the  flexibility  of  the  data 
structure  manipulation  made  possible  by  APSA.  Lists  and  vectors  are  just  special  cases  of  the  APSA  data 
structure. 

Parallel  garbage  collection  shows  another  way  that  APSA  speeds  up  algorithms,  and  it  also  illustrates 
how  two  levels  of  parallelism  can  be  combined  in  a  single  algorithm.  Garbage  collection  is  one  of  the 
most  crucial  algorithms  required  to  implement  applicative  languages,  and  its  performance  may  determine 
whether  an  applicative  programming  system  is  usable.  Other  researchers  have  proposed  several  ways  to 
construct  parallel  garbage  collectors.  Those  techniques  involve  two  processors  running  the  interpreter  and 
the  garbage  collector  simultaneously,  but  the  processor  doing  the  garbage  collection  is  strictly  sequential. 
APSA  also  runs  the  interpreter  and  garbage  collector  in  parallel.  In  addition  to  this  high  level  parallelism, 
the  APSA  heap  memory  is  able  to  mark  many  accessible  words  in  parallel,  achieving  low  level  parallelism 
within  the  garbage  collector.  The  resulting  garbage  collector  is  faster  than  it  would  be  if  either  form  of 
parallelism  were  used  without  the  other. 

APSA’S  HARDWARE  AUGMENTS  CONVENTIONAL  RAM  ADDRESS  DECODERS 

The  heap  memory  organizes  all  the  memory  cells  into  a  linear  shift  register  with  data  paths  connecting 
neighboring  cells.  In  addition,  some  of  the  memory  instructions  require  a  tree  of  combinational  logic 
connected  to  the  memory  cells.  Even  with  all  this  hardware,  the  APSA  heap  memory  is  similar  in  speed  to 
conventional  RAM  memories,  because  RAM  address  decoding  logic  has  the  same  time  complexity:  log  N 
for  N  words. 

This  analysis  explains  the  source  of  APSA’s  parallelism.  Conventional  memories  get  very  little  power 
from  their  address  decoders:  they  .iu.-.t  implement  fetch  and  store.  APSA  uses  its  combinational  logic  tree 
to  implement  a  variety  of  parallel  data  structure  primitives. 


EMULATION  OF  APSA  USING  FINE  GRAIN  PARALLEL  MACHINES 


In  general,  it  is  not  clear  how  to  implement  LISP  on  parallel  computers.  However,  the  APSA  research 
provides  an  approach  for  doing  this.  Instead  of  implementing  LISP  directly  on  the  parallel  computer,  we 
can  emulate  the  APSA  heap  memory,  which  is  usually  more  straightforward.  The  processing  elements  of 
the  real  parallel  computer  can  then  execute  the  applicative  data  structure  operations  in  parallel. 

Following  this  method,  APSA  is  currently  being  emulated  using  the  Goodyear  Massively  Parallel 
Processor  (MPP)  [3,  6,  7].  The  Connection  Machine  (CM)  [l]  would  also  be  a  good  host  for  emulating 
APSA.  This  raises  a  number  of  interesting  issues  relating  to  the  MPP  and  the  CM.  For  example,  the 
MPP  processing  elements  can  communicate  only  with  their  nearest  neighbors,  while  the  CM  has  long  data 
paths.  However,  the  MPP  has  an  advantage  that  may  compensate  for  its  lack  of  connections  between 
distant  processors:  the  MPP  communication  algorithms  are  deterministic,  synchronous  and  very  fast. 
Although  the  CM  can  send  messages  directly  over  long  distances,  it  cannot  guarantee  their  arrival  at  a 
particular  time.  Thus  the  APSA  emulation  requires  less  overhead  due  to  synchronization  on  the  MPP 
than  on  the  CM.  We  do  not  know  yet  what  the  relative  performance  of  the  MPP  and  the  CM  would  be. 

DISCUSSION 

There  is  a  tradeoff  between  how  specialized  a  machine  is  and  how  successful  it  is  on  a  wide  range  of 
applications.  There  are  two  ways  of  looking  at  APSA: 

•  It  is  a  special  purpose  machine,  and  its  intended  application  is  the  implementation  of  applicative 
language  interpreters. 

•  It  is  a  general  purpose  programming  language  host  which  can  run  algorithms  for  any  application. 

Thus  APSA  makes  any  applicative  program  run  more  efficiently.  APSA’s  chief  advantage  is  that  it  is 
easy  for  the  programmer  to  obtain  the  benefits  of  its  parallelism,  because  the  programmer  doesn’t  need 
to  worry  about  breaking  the  problem  into  independent  parts,  or  scheduling  a  number  of  processors,  or 
reducing  memory  contention.  However,  there  is  a  penalty  for  this  ease  of  use:  the  fastest  possible  parallel 
algorithm  for  a  particular  problem  may  not  be  able  to  run  on  APSA.  An  important  area  for  further  research 
is  investigating  how  to  fit  more  specialized  parallel  algorithms  into  the  APSA  framework. 

APSA  achieves  two  levels  of  parallelism  in  its  garbage  collection  algorithm:  (1)  simultaneous  instruc¬ 
tions  issued  by  two  processors  and  (2)  parallel  operations  within  the  heap  memory.  Many  other  problems 
also  exhibit  several  levels  of  inherent  parallelism,  and  it  will  be  fruitful  to  try  to  exploit  as  many  levels  as 
possible  in  one  system.  If  it  is  possible  to  combine  APSA’s  heap  memory  operations  with  multiprocessor 
parallelism,  the  resulting  architecture  might  provide  several  levels  of  parallelism  for  a  large  set  of  problems. 
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1.  Summary 

This  is  a  brief  outline  of  the  author's  recommendation  about  one  future  direction  which  computer 
architecture  ought  to  take.  The  focus  is  on  lowering  the  preponderant  evolvability  and  interoperability 
component  of  life  cycle  costs  for  real-time  cVbattle  management  computer  systems,  through  greatly 
improved  architectural  (particularly  processor)  modularity;  an  additional  cost/performance  contribu¬ 
tion  by  such  modularity  is  the  enhanced  feasibility  of  architectural  algorithmic  specialization  (an 
important  capability  despite  the  misunderstandings  of  RISC  proponents). 

2.  Recommendation 

This  recommendation  is  directed  towards  embedded  computers  for  real-time  supervisory  control, 
particularly  in  DoD  combat  platform  and  battle  management  applications  —  it  is  applicable  in  some 
degree  to  many  other  contexts. 

In  real-time  control  systems,  the  computer  requirements  are  ill-defined  initially  and  continue  to 
change  across  not  just  the  design  phase  but  even  across  the  entire  lifetime  of  the  system;  that  lifetime 
is  often  decades  in  DoD  and  certain  industry  (e.g.,  electric  utility)  applications.  The  principle  reasons 
for  this  are:  that  the  application  is  complex  and  is  not  (perhaps  cannot  be)  well  understood;  the 
application  environment  (e.g.,  the  threat)  varies  according  to  geopolitics,  doctrine,  etc.,  requiring  that 
the  system  functionality  be  altered,  the  performance  improved,  and  the  size  increased;  technology 
evolves,  making  some  desireable  system  attributes  possible  and  others  more  cost-effective,  but  also 
making  older  technology  first  exhorbitantly  expensive  and  then  unavailable  for  further  production  or 
even  maintainance,  so  internal  and  transparent  substitution  is  called  for;  the  computer  is  considered 
the  ultimate  recourse  for  accomodating  system  modifications,  because  "it's  a  simple  matter  of 
programming." 


We  will  use  the  term  modularity  for  the  characteristic  a  computer  system  needs  in  order  to  be 
successful  in  these  exceptionally  demanding  circumstances.  Common  synonyms  include 
"interoperability”,  "evolvability",  and  "extensibility,"  all  of  which  we  prefer  to  consider  system  at¬ 
tributes  which  depend  on  modularity. 

Both  software  (e.g.,  the  use  of  standard  high  level  languages,  and  increasingly,  operating  systems) 
and  hardware  play  a  part  in  computer  system  modularity.  We  are  concerned  here  with  hardware,  and 
specifically  with  approaches  for  individual  processors;  these  can  then  be  utilized  in  various  Kinds  of 
multiple-processor  computers  to  achieve  additional  modularity. 

The  costs  of  insufficient  computer  system  modularity  are  consistently  estimated  to  be  an  order  of 
magnitude  larger  than  the  initial  design  and  procurement  costs.  But  a  high  degree  of  modularity  is 
almost  never  even  one  of  the  driving  forces  in  processor  architecture,  which  suggests  that  computer 
system  designers  generally  exhibit  a  blind  spot  and  misplaced  priorities  (e.g.,  the  single-minded  pas¬ 
sion  for  throughput  in  millions  of  instructions  per  second),  at  least  so  far  as  the  real-time  control 
environment  is  concerned.  Consequently,  the  technology  for  achieving  modularity  is  very  primitive, 
and  much  more  research  and  experience  are  necessary. 

One  approach  has  been  the  family  of  upward-compatible  computers:  but  even  when  there  is  effec¬ 
tive  compatibility  among  family  members,  the  cost/performance  and  functionality  alternatives  are 
limited  and  have  been  chosen  by  the  computer  manufacturer  (with  his  own  constraints  and  objectives 
in  mind),  not  by  the  users  for  their  requirements. 

More  appropriate  would  be  an  expandable  range  of  processor  instruction  set  architectures,  starting 
with  a  basic  core  of  instructions  similar  to  the  RISC  notion,  except  that  neither  single-cycle  execution 
nor  frequency  of  use  are  inherently  the  proper  (much  less  sole)  criteria  for.  membership  in  this  core. 
On  such  a  foundation,  further  functionality  is  added  as  justified  by  cost-effectiveness  tradeoffs  to 
meet  current  system  requirements  — ■  for  example:  performance  (e.g.,  algorithmic  specialization,  not 
only  for  an  application  but  also  for  the  system,  such  as  object  invocation,  interprocess  communica¬ 
tion,  real-time  scheduling);  fault  tolerance  (e.g.,  support  for  atomic  transactions,  replication);  software 
compatibility. 

This  suggests  the  use  of  microcode  as  a  processor  modularity  technique,  although  historically  it  has 
been  devoted  primarily  to  instruction  set  interpretation,  with  fewer  instances  of  vertical  migration  from 
the  operating  system  level  and  fewer  still  from  the  application  level.  While  the  advantages  of 
microcode  have  not  yet  been  fully  exploited  (for  various  technical  and  non  technical  reasons),  as 


currently  envisioned  it  is  inherently  limited.  The  microinstruction  set  is  bound  to  a  particular  collec¬ 
tion  of  data  path  resources  —  the  functional  (e.g.,  arithmetic)  elements,  storage  units  (e.g.,  registers), 
and  transfer  paths  —  and  often  there  are  interpretation  segments  (e.g.,  macrolnstruction  fetch)  which 
are  hardwired.  This  permits  only  limited  flexibility  in  creating  instruction  set  architectures  and  func¬ 
tionally  specialized  microengines  tailored  to  given  requirements. 

Maleability  of  the  instruction  set  (exo-)architecture  is  only  part  of  the  solution,  however;  a  high 
degree  of  modularity  implies  that  its  implementation  (endo-architecture)  must  also  be  alterable  — 
e.g.,  acceleration  mechanisms  such  as  pipeline  stages  and  caches  can  be  included  when  and  where 
they  are  needed.  At  the  present  time,  the  instruction  set  architecture  and  its  implementation  are 
always  bundled  together  in  any  given  processor,  whether  or  not  it  is  microprogrammed. 

Another  form  of  modularity  is  found  in  the  idea  of  co-processors,  such  as  the  popular  floating  point 
units  for  microprocessors.  The  interfaces  to  current  incarnations  of  co-processors  are  improving,  but 
they  are  stilt  excessively  constrained.  Control  is  usually  master/slave  and  highly  stylized  (e.g.,  no 
form  of  nesting  or  other  composition),  execution  is  synchronized  to  the  instruction  sequencing  of  the 
main  processor,  and  there  are  too  few  shared  resources  between  the  processors.  An  ’extended 
function  unit’,  which  is  ordinarily  a  simple  synchronous  adjunct  to  the  processor’s  main  arithmetic 
unit,  can  also  be  used  for  incorporating  certain  extensions. 

Thus,  the  recommendation  is  that  effort  be  directed  toward  seeking  new  exo-  and  endo- 
architectural  concepts  and  techniques  which  overcome  the  modularity  obstacles  of  con¬ 
ventional  processors.  It  should  be  possible  to  have  a  physical  and  logical  framework  that  remains 
constant,  to  which  various  special-purpose  and  general-purpose  functional  (data  handling  and 
control)  hardware  and  microcode  elements  are  added  at  system  configuration  time  as 
cost/performance  tradeoffs  call  for.  This  can  be  done  in  a  manner  which  combines  the  best  of  both 
the  RISC  and  CISC  philosophies  —  complex  functions  implemented  in  hardware  or  microcode  are 
faster  than  equivalent  software,  but  do  not  slow  down  the  execution  of  simple  functions  which  require 
only  a  single  cycle.  It  should  be  possible  for  functions  to  be  migrated  among  software,  microcode, 
and  hardware  levels  transparantly  to  the  programmers,  perhaps  by  elmininating  the  classical 
dichotomy  between  software  invocation  mechanisms  (i.e.,  procedure  calls)  and  hardware/microcode 
invocaction  mechanisms  (i.e.,  op  codes),  making  the  invocation  of  every  functic  i  identical  regardless 
of  its  implementation. 
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FUTURE  DIRECTIONS  IN  COMPUTER  ARCHITECTURE 
AND  SOFTWARE  WORKSHOP 


Comments  by  J.  R.  Burke,  Chairman  Session  5 
Interconnection  Strategies 


Interconnection  networks  are  required  to  provide  communication  links  In  a 
multiprocessor  computer  system.  Whether  such  networks  are  static  or  dynamic, 
their  throughput  can  become  a  bottleneck  when  the  number  of  processors  Is 
large.  The  paper  "Image  Texture  Classification  with  an  Optical  Crossbar 
Interconnected  Processor,"  in  the  Interconnection  Strategies  session  describes 
a  development  which  may  overcome  this  bottleneck.  The  crossbar  is  effected 
optically  and  dynamically  using,  for  example,  lenses  or  acousto-optic  gratings 
to  determine  the  optical  path  between  an  array  of  laser  diodes  (representing 
sources)  and  an  array  of  photodiodes  (representing  destinations).  Messages 
would  be  passed  by  modulation  of  the  laser  diodes. 

The  following  recommendations  were  abstracted  from  the  summary  session  where 
the  workshop  presentations  were  discussed: 

1.  The  trade-off  between  design  of  custom  chips  vs.  off-the-shelf  ships 
will  require  modified  algorithms. 

2.  Differential  between  processor  speed  and  memory  access  time  is  increasing. 
On  chip  memory  and  overlapped  pipelining  of  instruction  and  operands  is 
helping.  However,  there  are  no  revolutionary  Ideas  as  to  how  to  fix  this 
limiting  bottleneck. 

3.  TI  presented  a  paper  on  a  fiber  optic  crossbar  switch  for  a  mul ti processor 
i nterconnection  network.  Because  of  large  bandwidth  of  fiber  optics, 

the  bit  rate  can  be  much  greater  than  on  wire.  Thus  serial  rather  than 
parallel  word  transmission  can  be  used.  This  significantly  reduces  the 
number  of  wires  that  need  to  be  used.  There  was  unanimous  agreement  that 
this  technology  should  receive  further  support.  As  the  DARPA  contract 
with  TI  is  about  to  end,  I  have  suggested  to  John  Zavada  that  he  might 
want  to  consider  a  follow-on. 

4.  It  was  argued  by  some  of  the  attendees  that  there  are  important 
applications  where  a  multiprocessor  system  containing  10,000  processors 
would  be  useful  ( 1 . e . ,  an  imaging  system  using  one  processor  per  bit). 

In  contrast  to  general  belief,  It  was  also  argued  that  programs  to  exploit 
such  a  large  degree  of  parallelism  would  be  written  if  the  machines  were 
available. 

5.  As  a  corollary  to  H,  there  Is  a  need  to  provide  multiprocessor  machines 
to  the  community  in  order  to  accelerate  experience  in  writing  and  testing 
programs  to  exploit  parallelism  for  speed-up. 

6.  Automatic  dynamic  reconfiguration  of  processor  interconnections  to  fit 
changing  precedence  graphs. 


7.  An  evaluation  of  what  are  the  desired  performance  measures  for  machines. 


8.  More  appropriate  benchmark  algorithms/programs. 


9.  Theoretical  measures  for  determining  architecture  effectiveness 
Independent  of  software. 

10.  Tools  to  retarget  software  to  new  machines  needed  to  avoid  Inefficiency 
of  older  languages  such  as  Fortran  on  such  machines.  Billions  of 
dollars  has  been  Invested  In  software  development  which  Is  now  OBE. 

11.  While  distributed  operating  systems  provide  fault  tolerance,  efficiency 
and  survivability,  cooperation  and  correlation  are  problems.  Techniques 
for  global  resource  management  need  to  be  developed. 

12.  Tradeoffs  between  the  uses  of  synchronization  versus  asynchronization 
need  to  be  examined.  Partial  synchronization  may  be  more  important 
for  some  problems. 

13.  Need  theory  for  non-seri all  zed  transactions. 

14.  Heterogeneous  networks  of  machines  are  needed  for  some  applications. 

Work  on  how  to  develop  data  bases  for  such  networks  is  needed. 

15.  More  work  on  Logic  and  Functional  Programming. 
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Doyce  Satterfield 
SUMMARY 

ADVANCED  DISTRIBUTED  ON-BOARD  PROCESSING 

The  U.S.  Army  Strategic  Defense  Command  has  been  over  the  past  few  years 
investigating  advanced  signal  and  data  processing  technology  for  obtaining 
target  information  with  complex  infrared  (IR)  sensors.  The  results  of 
these  investigations  have  culminated  in  the  development  of  a  very  sophisti¬ 
cated  hardware-in-the-loop  signal  and  data  processing  testbed.  This  testbed 
facility  consists  of  a  Mosaic  Sensor  Emulation  Unit  (MSEU),  an  Auxilliary 
Sensor  Signal  Processor  Unit  (ASSPU),  and  an  Advanced  Distributed  Onboard 
Processor  (ADOP). 

The  ADOP  system  has  a  15  MIP  capability  and  comprises  five  processing 
elements  or  nodes  which  communicate  on  three,  1  M  word/sec  independent 
global  buses.  Each  node  contains  three  Central  Processing  Units  (CPUs), 
three  Floating  Point  Processors  (FPPs),  one  million  (1M)  words  of  17-bit 
memory,  and  a  Global  Bus  Interface  Controller  (BIC). 

The  ADOP  node  architecture  exercises  four  basic  functions: 

(1)  A  fixed  point,  MIL-ST0-1750A  instruction  set  architecture,  CPU 
capable  of  executing  the  fixed-point  Digital  Avionics  Instrumentation 
System  (DAIS)  instruction  mix  at  1  MIPS.  The  CPU  implements  a  pipelined 
instruction  queue  and  embedded  memory  management  unit  function  to  achieve 
this  throughout,  while  addressing  up  to  1M  words  of  memory.  The  basic  CPU 
microcycle  time  in  200  nsec  while  the  memory  system  access  time  is  350  nsec. 

(2)  A  FPP  serves  as  an  adjunct  to  the  CPU  to  facilitate  achievement  of 
1  MIPS  full  DAIS  floating  point  operation.  Using  the  built-in  function 
generator  option  of  MIL-STD-1750A,  special  macro  instructions  can  be  coded 
to  do  special  operations  for  signal  processing  operations.  The  FPP  contains 
a  hardware  multiplier  and  an  ALU  which  essentially  "shadows"  the  CPU 
arithmetic  Logic  Unit. 

(3)  A  four  port  local  memory  organized  as  25 6K  words  by  17  bits  (1  parity 
bit).  A  read-modi fy-write  function  facilitates  semaphore  operations  required 
by  the  operating  system.  The  system  of  node-local  buses  provides  access  to 

all  of  memory  by  each  CPU. 

(4)  A  BIC,  which  coupled  with  the  real-time  operating  system, 
efficiently  handles  all  global  communications.  The  BIC  accesses  message 
queues  in  node-local  memory  and  moves  messages  on  the  system  of  global  buses. 

A  logical  addressing  scheme  allows  the  routing  to  be  configured  by  the 
operating  system  without  affecting  application  software. 

The  ADOP  node  structure  allows  virtually  contention  free  processing  of  an 
application  program  when  data  sets/algorithms  are  mapped  non-overlapping.  To 
this  end,  a  complete  set  of  software  tools  has  been  developed,  including  a  PASCAL 
compiler  and  a  distributed  operating  system.  The  use  of  these  tools  provides 
a  strict  adherence  to  programming  standards  and  configuration  controls  that  make 
experimentation  on  the  Multi-CPU,  Multi-Node  (distributed)  architecture  an 
achievable  task. 


SUMMARY 


ADVANCED  DISTRIBUTED  ON-BOARD  PROCESSING 
(Conti nued) 


The  distribution  of  data  bases,  the  mapping  onto  the  architecture 
and  the  distributed  operating  system  for  this  research  and  development 
program  will  be  described  at  the  5-7  May  1986  workshop. 
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ARO  workshop  conclusions 
Jon  Mauney 


I  will  restrict  my  comments  to  two  software  areas  discussed  at  the 
workshop,  leaving  the  other  areas  to  those  with  more  expertise. 

Reusable  Software 

At  the  wrap-up  session,  two  distinct  topics  were  grouped  under  the 
heading  of  "Reusable  Software."  The  first  problem  is  the  continued  use 
of  existing  applications  on  new  systems.  This  is  an  important  problem 
because  of  the  vast  investment  represented  by  the  body  of  existing  code. 
To  protect  this  investment,  the  code  must  be  adapted  to  new  versions  of 
a  system,  and  to  new  operating  systems,  which  may  support  different 
dialects  of  the  programming  language.  Such  transport,  although  impor¬ 
tant,  does  not  seem  to  be  a  major  research  issue.  The  presentation  by 
Takefuji  and  Dowell  at  the  workshop  described  a  fairly  simple  expert 
system  for  dialects  of  LISP;  the  Lexeme  corporation  is  advertising  a 
commercial  product  that  translates  among  high-level  languages  such  as 
Fortran,  Pascal,  and  Ada. 

A  harder  problem  is  efficient  execution  of  existing  code  on  very 
different  hardware,  such  as  vector  and  parallel  machines.  Again,  this 
is  important  if  the  benefits  of  new  architectures  are  to  be  extended  to 
existing  programs.  This  problem  is  addressed  by  much  of  the  research 
into  parallel  compilers,  especially  the  so-called  "dusty-deck"  com¬ 
pilers.  There  is  still  a  lot  to  be  done  in  this  area.  However,  the 
general  feeling  of  the  languages-and-compilers  researchers  at  the 
workshop  was  that  the  future  lies  with  newer  languages,  more  closely 
related  to  logic  and  functional  programming  than  to  Fortran. 

The  other  Reusable  Software  problem  is  one  of  productivity  in  the 
writing  of  new  applications.  Programmers  can  be  more  productive  if  they 
can  reuse  parts  of  previous  projects,  or  access  a  library  of  program 
building  blocks,  rather  than  having  to  start  from  scratch.  This  problem 
is  the  one  addressed  by  the  presentations  in  the  workshop,  and  it  is  the 
kind  of  thing  that  the  STARS  program  was  set  up  to  attack.  This  is 
clearly  an  important  problem  for  the  future. 

Functional  and  Logic  Programming 

At  the  session  on  Logic  and  Functional  Programming  there  was,  of 
course,  agreement  that  this  style  of  programming  is  the  direction  to  go 
in  the  future.  Most  people  working  on  compilers  and  related  software 
for  parallel  architectures  have  decided  that  functional  languages  have 
overriding  advantages.  It  may  be  worthwhile  studying  ways  to  convince 
the  everyday  programmer  that  this  is  so. 

The  current  research  projects  presented  at  this  session  were 
divided  on  the  question  of  automatically  detected  parallelism  versus 
explicitly  programmed  parallelism.  One  presentor  began  by  stating  that 
he  rejected  explicit  parallelism  and  described  how  he  avoided  it, 
another  described  a  parallel  architecture  for  LISP  using  explicitly 
coded  parallelism.  This  disagreement  applies  to  conventional  imperative 
languages  as  well  as  logic  and  functional  languages.  In  the  discussion 


following,  Che  consensus  was  Chat  automatic  detection  of  parallelism  is 
preferable,  especially  if  very  large  numbers  of  processors  are  avail¬ 
able,  but  that  there  will  always  be  applications  that  push  the  limits  of 
performance  and  require  close  programmer  control  of  parallelism.  It 
would  seem,  therefore,  that  it  will  continue  to  be  important  to  do 
research  in  automatic  detection  and  exploitation  of  parallelism  (both  in 
conventional  and  in  functional  languages),  and  to  determine  what  the 
best  trade-off  point  is  between  automatic  and  explicit  parallelism. 


Bell  Laboratories 


Crawfords  Corner  Road 
Holmdel.  New  Jersey  07733 
Phone  (201)  949-3000 


May  16,  1986 


Professor  Dharma  P.  Agrawal 

Professor  &  Workshop  Chairman 

North  Carolina  State  University 

Department  of  Electrical  &  Computer  Engineering 

Box  7911 

Raleigh,  NC  27695-7911 


Dear  Professor  Agrawal: 

I  enjoyed  attending  the  workshop  on  future  directions  in  architecture  and  software. 
Although  I  spent  only  one  day  at  the  conference  I  found  the  papers  very  interesting.  More 
importantly,  the  attendees  represented  the  right  mix  of  academia  and  industry  and  I  found 
the  discussions  extremely  fruitful. 

The  only  area  I  would  like  to  see  covered  in  addition  to  the  ones  already  discussed  at  the 
conference  is  the  availability  of  new  technologies.  Development  of  new  technologies  such  as 
wafer  scale  integration,  optical  computing,  etc.,  can  (and  will),  have  a  major  impact  on 
future  computer  systems. 

Thank  you  very  much  for  the  opportunity  to  participate  in  the  workshop.  I  look  forward  to 
interacting  with  you  again  in  the  future. 

Yours  very  tful)^  / / 

Sudhir  R.  Ahuja 
Head 

System  Architectures 
Research  Department 


DEPARTMENT  OF  THE  NAVY 

NAVAL  TRAINING  5ySTEMS  CENTER 


2  Jun  86 


Dr.  Dharma  P.  Agrawal 

Department  of  Electrical  and  Computer  Engineering 
North  Carolina  State  University  School  of  Engineering 
Box  7911 

Raleigh,  NC  27695-7911 
Dear  Dr.  Agrawal: 

I  must  apologize  for  not  answering  your  letter  sooner  relating  to  the  recent 
ARO-sponsored  workshop.  Official  travel  seems  to  have  occupied  most  of  my  time 
since  returning  from  the  workshop. 

Speaking  very  candidly,  I  did  receive  benefits  from  participating;  such  as 
informal  professional  opinions  and  technical  discussions  in  addition  to  the 
contacts  made.  However,  as  was  discussed,  the  thrust  of  some  of  the 
presentations  covered  previous  work  rather  than  postulate  future  directions  of 
the  technology.  Concerning  my  session  (Session  4),  I  did  feel  that  a  portion 
of  the  presentations  were  within  the  workshop  theme.  The  discussion  by  Dr. 

A.  J.  Smith  of  UC-Berkley  concerning  the  future  of  memory  hierarchies  seemed 
to  be  of  particular  interest  to  the  participants.  The  presentation  of  Dr.  P. 

A.  Subrahmanyane  of  AT&T  Labs  concerning  parallel  architectures  for  Integrated 
system  design  tools  was  unique  to  a  technology  area  (CAD/CAM)  that  is  becoming 
increasingly  important  to  the  total  computer  technology  arena. 

For  my  own  input  to  you,  I  am  directly  involved  in  exploratory  development 
(6.2)  and  advanced  development  (6.3)  projects  related  to  specLal  parallel 
processing  architectures  for  unique  very  high  speed  processing.  These  also 
relate  to  future  real-time  artificial  intelligence  (expert  systems)  adjuncts 
to  training  systems.  The  potential  of  dataflow  architectures  to  provide 
solutions  to  our  long  range  requirements  is  also  being  addressed.  It  goes 
without  saying  that  with  the  advent  of  the  new  standard  DoD  language,  Ada, 
software  developments  should  assume  a  new  and  different  perspective  for  the 
future.  Much  work  needs  to  be  accomplished  in  this  area. 

My  overall  evaluation  of  the  workshop  was  positive  and  I  am  of  the  opinion  it 
served  a  need  that  should  be  exploited  in  the  future.  I  would  hope  ARO  would 
see  the  potential  for  and  value  of  such  workshops  and  assume  the  initiative  in 
planning  and  executing  them  periodically.  I  would  appreciate  any  opportunity 
to  attend  and  participate  in  any  such  workshops. 

I  felt  that  you  and  your  staff  did  a  very  fine  job  in  planning  and  running  the 
workshop.  The  location  chosen  was  certainly  unique  and  the  facilities  adequate. 

I  will  appreciate  maintaining  a  contact  with  you  in  the  future  for  the  beneficial 
exchange  of  information  related  to  the  total  area  of  computer  technology. 

Sincerely,  . 


C.  FORREST  SUMMER,  P.E. 

Head,  Advanced  Computer  Technology  Branch 
R^^arch  ami  Development  Department 


August  4,  1986 

TO:  Dharma  P.  Agrawal 

Department  of  Electrical  and  Computer  Engineering 

232  Daniels  Hall 

North  Carolina  State  University 

Raleigh,  NC  27695 

FROM:  R.  R.  Oldehoeft  L-419 

Lawrence  Livermore  National  Laboratory 
P.  0.  Box  808 
Livermore,  CA  94550 

SUBJECT:  Future  directions  in  architecture  and  software 

Thank  you  for  your  recent  telephone  call.  I  had  intended  a  prompt 
response  to  your  original  request,  but  the  end-of-semester  rush  com¬ 
bined  with  my  relocation  to  California  for  my  sabbatical  leave  prevented 
me  from  getting  to  it.  I  assumed  my  input  was  no  longer  timely,  but  I 
appreciate  this  opportunity  nevertheless.  Since  I  am  not  an  architec¬ 
tural  specialist,  I  can  only  comment  from  my  point  of  view  as  an 
academic  computer  scientist  and  a  software  designer. 

1.  An  important  requirement  for  progress  in  parallel  processing 
research  is  the  provision  of  parallel  processing  testbeds  for 
academic  researchers.  Since  many  will  not  have  the  capability  or 
propensity  for  construction  and  maintenance  of  custom  systems, 
this  means  that  commercial  systems  are  necessary.  Fortunately, 
some  are  currently  available,  and  they  represent  a  variety  of 
approaches  to  parallel  processing.  They  include  BBN’s  Butterfly, 
the  Sequent  Balance  21000,  the  Loral  Dataflow  system,  Intel’s 
Hypercube,  the  Ncube  boards  for  use  with  an  IBM  AT  microproces¬ 
sor,  vector  processor  add-ons,  etc.  More  will  surely  appear. 

Probably  the  best  way  to  make  these  systems  widely  available 
(except  for  the  cheapest)  is  to  establish  network  access  to  “many" 
centers  whose  purposes  are  several.  First,  the  centers  must  make 
availability  and  uptime  of  the  systems  and  the  network  their  prime 
concerns.  Second,  they  need  to  help  their  clients  in  learning  to  use 
the  systems.  Third,  they  must  individually  and  collectively  dissem¬ 
inate  useful  information,  tools  and  techniques  to  the  entire  com¬ 
munity  and  the  larger  audience  as  well.  Funding  needs  to  be  avail¬ 
able  for  the  establishment  and  continuing  support  of  these  centers, 
and  the  country’s  network  capability  needs  to  be  expanded  to 
reach  more  researchers. 

There  should  be  an  educational  aspect  to  these  centers  as  well  We 
need  many  more  people  with  an  appreciation  of  parallel  processing 


as  a  general  technique  for  effective  programming.  To  the  extent 
possible,  these  centers  should  support  some  instructional  access  to 
the  supported  machines.  The  extent  depends  on  the  amount  of 
total  access  that  can  be  provided. 

2.  On  the  software  side,  I'd  like  to  discuss  current  issues  with  which  I 
have  some  familiarity.  The  importance  of  applicative  (functional) 
programming  languages  is  beginning  to  be  understood  with  the 
early  results  of  the  SISAL  project,  among  others. 

a)  Additional  language  development  is  necessary  to  bring  projects 
“up  to  speed”  with  respect  to  features  that  have  been 
developed  largely  outside  the  domain  of  multiprocessing. 
Vector/array  processing  capabilities  and  modularity  con¬ 
structs  for  developing  large  systems  are  two  examples. 

b)  The  requirement  that  extant  codes  remain  useful  in  the  paral¬ 
lel  functional  world  presents  some  interesting  research  oppor¬ 
tunities.  In  particular,  libraries  of  useful  routines  need  to  be 
accessible  from  applicatively  written  programs  without 
sacrificing  two  advantages  of  writing  in  an  applicative 
language.  Parallelism  must  be  reasonably  well  preserved,  and 
the  functional  nature  of  computing  must  be  protected  from 
arbitrary  interference  by  these  imperatively  written  routines. 

3.  Finally,  we  need  some  studies  in  the  granularity  of  parallelism 
different  from  the  current  approach.  We  use  the  granularity  pro¬ 
vided  by  a  parallel  processing  system  to  obtain  one  point  of  data 
describing  what  works  on  a  particular  machine.  Instead,  we  should 
find  ways  to  easily  alter  the  units  of  parallel  work  as  well  as  the 
detailed  hardware  structures  that  do  the  work.  This  means  that 
simulation  studies  will  continue  to  be  important  for  this  research. 

Once  again,  thank  you  for  the  opportunity  to  respond  in  such  a 
tardy  fashion.  Please  let  me  know  if  I  can  be  of  further  help. 


'I 


Comments  on  the 

ARO  Workshop  on  Future  Directions  in 
Computer  Architecture  and  Software,  May  1988 

L.  N.  Bhuyan 


The  Workshop  organized  by  the  Army  Research  Office  was  a  great  success. 
Although  the  scope  of  discussion  was  a  little  broad,  there  was  a  lot  of  personal  rap¬ 
port  between  the  researchers  working  in  definite  areas.  Particularly,  the  short 
presentations  and  discussions  on  the  last  day  of  the  workshop  were  lively. 

In  general  there  was  a  feeling  that  the  software  was  lagging  behind  the  archi¬ 
tecture.  It  is  quite  natural  because  not  many  parallel  systems  have  been  built  for 
the  researchers  to  do  experiments  with.  It  is  my  opinion  that  more  effort  needs  to 
be  spent  on  the  design  of  operating  systems,  algorithms,  and  software  both  at  basic 
research  and  experimental  levels.  At  the  same  time,  the  ARO  should  continue  sup¬ 
porting  innovative  ideas  in  architecture  and  encourage  design  and  implementation 
of  such  architectures  at  the  universities  and  industries. 
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COMMENTS  ON  COMPUTER  A&S  WORKSHOP:  John  Zavada 

While  a  number  of  interesting  talks  were  given  at  the  workshop,  onl7  a  few 
people  seemed  willing  to  express  their  ideas  for  new  directions  of  research. 

Most  of  the  talks  were  centered  about  an  individual  investigator's  recent 
research  activities.  While  the  people  at  the  workshop  represented  a  cross 
section  of  the  foremost  researchers  in  this  area,  I  think  that  the  body  of 
presented  talks  is  an  indication  of  the  direction  of  current  and  near-term 
research  and  not  directly  a  source  of  ideas  for  future  directions.  Of  course, 
there  were  exceptions  and  I  will  comment  on  several  that  T  noticed. 

Col.  Ward  gave  a  very  interesting  talk  concerning  some  of  the  DoD  needs  in  the 
area  of  computer  A&S.  One  of  the  items  that  impressed  me  dealt  with  the  huge 
volume  of  software  in  DoD  use  and  the  effort  to  convert  the  data  into  a  single 
language  (ADA).  While  this  doesn't  sound  very  glamorous,  I  view  this  as  a  major 
problem  confronting  DoD  operations.  If  existing  codes  for  aircraft,  missile 
defence,  etc.  are  to  be  recoded  accurately,  either  a  great  many  computer  people 
will  have  to  be  hire  by  DoD  or  a  very  long  time  will  be  required.  This  topic 
did  not  seem  attract  much  attention  in  the  body  of  talks  at  the  workshop.  An 
area  in  which  thi3  need  would  seem  to  be  critical  is  that  of  nuclear  weapons. 

The  software  that  is  utlized  for  nuclear  missile  firings,  guidance,  and  control 
must  simply  be  of  the  highest  quality  and  free  of  errors. 

Another  topic  that  Col.  Ward  mentioned  was  the  need  for  machines  capable  of 
billions  of  operations  per  second.  This  topic  is  often  mentioned  in  workshop 
meetings  either  on  electronics  or  computers.  I  feel  that  there  are  a  number  of 
approaches  to  this  problem  but  that  nobody  is  quite  sure  on  how  to  achieve  this 
goal  other  than  by  brute  force:  smaller  electronic  devices,  higher  chip  device 
density,  networking  of  computers.  It  is  much  harder  to  develop  new  ways  to 
process  the  data  and  to  realize  efficient  parallel  processing.  Very  few  talks 
seemed  to  address  new  techniques  for  achieving  such  processing. 

D.  Jensen  of  CMU  tried  in  his  talk  to  simulate  ideas  concerning  future  research 
directions.  I  think  that  his  comments  questioning  present  computer  techniques 
are  valid.  However,  there  wasn't  much  time  for  discussion  and  his  talks  often 
have  a  humorous  aspect  that  may  distract  the  audience  from  his  central  message. 
Still  there  should  be  more  discussions  on  the  present  status  of  computer 
science  and  on  the  historical  development  of  the  area. 

A.  McAulav  of  TI  gave  an  interesting  talk  on  the  use  of  optics  for  parallel 
processing  and  interconnects.  While  much  of  his  talk  was  devoted  to  describing 
the  hardware  for  achieving  an  optical  crossbar  switch,  the  actual  use  of  such 
a  device  for  computing  was  only  briefly  treated.  I  think  that  the  use  of  optics 
in  computer  A&S  i3  a  largely  undeveloped  field.  Clearly,  optics  is  going  to 
play  a  greater  role  in  electronic  devices  and  chips,  but  how  is  the  architecture 
and  software  going  to  accommodate  the  added  capabilities  optics  provide.  Along 
with  this  use  of  optics  is  the  question  of  reconf igurable  architecture.  Optics 
holds  the  promise  of  doing  this  if  a  number  of  material  and  manufacturing 
problems  can  be  overcome.  But  only  a  small  amount  of  work  seems  to  be  devoted 
to  exploring  how  this  reconfiguration  can  be  utilized.  G.  Gray  of  '/PI  is  doing 
work  for  fault  tolerant  operations  but  there  must  be  other  areas  that  could 
benefit  from  reconfigurability. 

J.  Staudhammer  of  Univ.  of  FL  presented  interesting  results  concerning  3-D 
color  graphics.  His  talk  centered  on  image  generation  and  imaae  transition. 
Symbolics  is  an  important  area  of  computer  applications  especially  in  the 
DoD  arena.  Manv  situations  call  for  the  maniculat ~or.  of  symbols  and  more 
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effort  should  be  devoted  to  performing  these  operations  efficiently.  The 
implications  on  computer  A &S  should  be  studied. 

In  general,  it  is  very  difficult  to  predict  the  future,  even  for  experts  in 
a  given  area.  Also,  in  an  open  forum  many  such  experts  may  be  reluctant  to 
express  their  new  ideas  or  feelings.  Novel  ideas  are  probably  best  reserved 
for  funding  agencies  and  Ph.D.  topics.  Consequently,  it  is  easier  to  present 
results  of  studies  that  have  been  completed  and  submitted  for  publication. 

An  example  of  this  can  be  found  in  the  talk  by  S.  Lundstrom  of  MCC. 

While  his  talk  was  informative,  it  was  a  summary  of  work  that  has  already  been 
completed  for  the  airline  industry.  Directions  that  MCC  is  following  were  not 
mentioned.  Another  aspects  is  that  only  a  few  people  have  the  experience  and 
knowledge  to  accurately  judge  trends  and  developments.  Usually,  these  are  the 
older  people  with  years  of  work  in  the  area.  People  like  Jensen  are  a  valuable 
source  of  information  but  there  has  to  be  suitable  forum  for  them  to  expound 
their  views  and  listen  to  opposing  comments. 

Perhaps  there  should  be  a  follow-up  workshop  involving  a  smaller  number  of 
people  drawn  from  the  ones  that  were  at  Seabrook.  Their  objective  would  be  to 
discuss  the  results  of  Seabrook  I  and  to  develop  detailed  scenarios  as  to  the 
future  directions  of  computer  A&S. 


John  Zavada 
ARO/  18  Aua  86 


ARO  WORKSHOP  CONCLUSIONS 


Doyce  Statterf leld 


Mav  20,  1986 


Dear  Dr.  Agrawal: 

I  was  very  pleased  to  participate  In  the  excellent  ARO-soonsored 
workshop.  You  had  It  well  organized  and  with  many  outstanding  presentations. 
The  Wednesday  afternoon  discussions  were  Informative  thanks  to  vour  taking  the 
time  to  prepare  the  lead-in  view  graphs. 

As  far  as  feedback,  I  have  given  this  considerable  thought  and  from  the 
ARO  perspective  of  balancing  their  basic  research  program  In  computer  science. 
My  general  feeling  Is  that  their  program  Is  well  structured  and  balanced  for 
their  budget.  I  believe  that  we  still  have  a  lot  of  research  to  do  before 
there  will  exist  a  fully  distributed  computing  system.  That  Is,  a  computing 
system  that  has  distributed  processors,  distributed  data  hase,  and  a 
distributed  operating  system,  all  required  to  realize  the  full  potential  of 
distributed  computing. 

One  area  that  appears  to  be  a  bottleneck  for  high  performance  processing 
is  that  of  memory  access  time.  If  there  was  one  area  that  I  would  like  to  see 
more  effort  In,  it  would  be  this.  Another  important  area  Is  that  of  task 
assignment  and  mapping  algorithms  where  the  real  world  problems  are  a  mix  of 
sequential  and  parallel  operations.  I  don't  know  that  we  are  smart  enough  In 
this  area. 

My  problem  really  Is  looking  at  basic  research  and  not  trvlng  to  make  It 
Immediately  solve  an  existing  real-world  problem  —  like  a  scan-to-scan 
algorithm  for  a  passive  sensor  —  but  this  Is  mv  problem. 

Again,  thank  vou  for  considering  me  as  a  participant  for  vour  workshop. 

I  look  forwrd  to  seeing  vou  again. 


COMMENTS  ON  ARO  WORKSHOP  ON  FUTURE  DIRECTIONS 


IN  ARCHITECTURE  AJVD  SOFTWARE,  May  5-7,  1986 

C.A.  Papachristou 


A.  The  general  problem  of  mapping  software  algorithms  or  functions  into  parallel  archi¬ 
tectures  has  not  been  formulated  adequately.  Some  speakers  presented  one  or  two 
interesting  techniques  but  they  were  all  ,:mited  to  special  purpose  systems,  e  g.,  image 
processing.  More  work  needs  to  be  done  in  this  area. 

B.  There  were  several  talks  on  distributed  computer  systems  but  little  was  said  about 
parallelism  in  embedded  computer  systems  such  as  the  ones  incorporated  in  a  large 
platform  (e.g.,  an  aircraft).  There  are  differences  between  distributed  and  embedded 
systems  with  regard  to  capabilities  for  parallel  processing.  Embedded  microsystems 
require  parallelism  at  the  microlevel,  i.e.,  configurable  microarchitectures  that  are 
amenable  to  VLSI  and  sophisticated  microcontrol  designs  to  support  these  structures. 
A  research  program  in  these  areas  is  needed. 

C.  There  is  a  need  of  formal  synthesis  techniques  and  tools  for  multiprocessing  architec¬ 
tures.  There  are  several  tools  with  the  capability  to  automate  or  semiautomate  the 
synthesis  of  processors;  there  are  no  such  techniques  for  multiprocessors.  At  present, 
multiprocessors  are  put  together  basically  by  intuitive  and  empirical  methods  and  then 
the  design  is  evaluated  by  analysis  or  simulation.  With  the  expected  proliferation  of 
parallel  architectures,  there  will  be  a  need  to  formalize  the  synthesis  of  such  machines 
on  the  basis  of  formal  requirements,  specifications  and  design  processes. 

Following  are  ten  areas  which  I  feel  should  be  supported  by  the  ARO  research  pro¬ 
grams  in  Computer  Science  over  the  next  five  years  or  so.  Several  of  these  areas  were  dis¬ 
cussed  during  the  team  leaders  meeting  (I  missed  it  but  1  have  a  copy  of  the  transparen¬ 
cies).  No  particular  order  of  significance  is  assumed. 

1.  Instruction  set  architectures  and  microprogramming 

2.  VLSI  architectures  and  optical  technologies 

3.  Embedded  systems  -  parallelism  at  the  microlevel 

4.  Distributed  systems  (including  Operating  systems) 

5.  Concurrency  and  concurrent  systems  (both  synchronized  and  asynchronous) 

6.  Mapping  algorithms  into  multiprocessor  architectures 
7  Interconnection  networks  and  parallel  architectures 

8.  Synthesis  tools  for  multiprocessor  and  architectures 

9.  Reusable  software  design 

10.  Programming  issues  (including  parallel  and  logic  programming) 


Comments  on 


May  9,  1986 

Kim  Gostelow 


Futurt  Directions  in  Computer  Architecture  and  Software 


I  have  made  my  report  in  the  form  of  the  following  comments: 

1.  The  remark  was  made  that  “software  is  behind  hardware".  To 
correct  this  situation  it  is  necessary  to  encourage  organi¬ 
zations  to  buy/build  hardware  and  experiment  on  it.  i.  e.  . 
work  with  real  systems  and  learn.  That  is.  the  problem  is 
to  some  extent  managerial.  Only  when  large  numbers  of 
organ! zations  have  such  machines  and  do  work  on  them  will 
the  appropriate  directions  be  learned. 

Lef's  get  on  with  it  and  build  the  machines;  then  the 
softuare  uill  catch  up.  Software  certainly  won!'t  catch  up 
without  the  practical  experience  gained  from  working  with 
actual  machines. 

2.  Another  area  involves  the  time  and  expense  of  producing 
large  parallel  systems.  Such  systeus  require  custom  chips 
which  are  as  expensive  and  time-consuming  to  make  as 
softuare.  Combined  with  the  softuare  costs,  the  time  from 
concept  to  system  availability  is  far  too  long  and  far  too 
expensive  to  da  a  great  deal  of  experimentation. 

Same  kind  of  design  tools  are  needed  to  translate  ideas 
into  reality  in  a  much  mare  automatic  fashion  than  is 
currently  possible. 

3.  A  topic  not  mentioned  at  the  workshop  is  packaging.  The 
impact  of  packaging  is  much  more  than  is  commonly  recog¬ 
nized.  For  example,  the  comment  was  made  that  "memory 
speeds  are  falling  behind  that  of  the  cpu".  The  reason  is 
not  that  memory  is  not  fast  enough,  since  memory  is 
currently  available  that  is  even  faster,  in  proportion  to 
cpu  speeds,  than  10  years  ago.  For  example.  23  ns  memory  is 
currently  available  in  64K  bit  chips. 

The  problem  is  packaging.  By  the  time  the  memory  is 
mounted  on  boards  in  sufficient  quantity,  the  effective 
speed  of  the  above  23  ns  chips  is  almost  down  to  100  ns  (in 
round  numbers).  This  can  be  improved  upon  only  by  packag¬ 
ing.  When  such  23  ns  memory  can  be  used  effectively,  then 
the  problem.]  for  multiprocessors  brought  about  by  caches 
(for  example,  the  cache  consistency  problem)  are  solved  — 
simply  eliminate  caches  (one  can  even  pipe  the  memory  and 
the  processors  for  even  greater  resource  utilization). 

4.  The  problem  of  "RISC  verses  CISC",  and  the  problem  of 
"instruction  set  architecture"  are  not  problems  at  all.  I 
agree  uith  the  comment  that  "it  is  a  problem  of  good 
engineering"  and  nothing  more. 
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ch  direction*  of  interest  to  ARO: 


The  hey  aspects  of  cooputer  development  in  the  next  decade< 
and  for  most  Army  applications*  especially  involving  uidely 
distributed  resources  and  personnel!  uill  be. 

1)  The  explosion  in  COMMERCIALLY  available  parallel  electronic 
processors* 

2)  The  rapid  development  of  neu  algorithms,  efficient  on  these 
uidely  available  parallel  machines; 

3)  Improvements  in  parallel  programming  techniques  and  languages; 

4)  Computer  Aided  Design  tools  that  uill  aid  in  rapid  application 

of  new  cooputer  technology*  such  as  VLSI  custom  integrated  circuitry* 
9)  Computing  power  enough  to  finally  realize  much  better  human 
interfaces  especially  high  resolution  color  graphics,  realtime 
speech  generation  and  understanding*  arid  tactile/motion-detecting 
sensors; 

6)  The  increasing  importance  of  communication  and  suitching  based  on 
fiber  optics  that  uill  push  to  becone  very  fast*  full-fledged 
optical  processing  ultMn  10  years. 

In  summary,  the  main  two  uays  to  get  better  computing  uill  be  through 
parallelism  and  though  much  faster  individual  processor  circuitry. 

The  short-term  parallelism  limits  are  about  1*C00  powerful  (1  MIPS  +) 
processors  or  about  100.000  1-bit  processsors  for  aggregates  of 
1  to  10  billion  operations  per  second  (1  -10  BIPS). 

Within  10  years*  the  fastest  parallel  computers  uill  provide  100  to 
1000  BIPS  from  about  100.000  heavy  processors  in  massive  systems 
sharing  many  levels  of  memory  or  about  10* COO.  COO  simple  processors  - 
essentially  very  simple  eye/brain  combinations.  The  neu  algorithm* 
programming*  and  language  issues  uill  be  the  key  to  harnessing 
parallelism  <  although  some  work  uill  be  needed  in  improving 
hardware  switching  and  control  mechanism  to  allou  rapid  access 
to  many  level  of  shared  memory  in  massively  parallel  systems. 

Creating  faster  individual  processors  uill  be  a  matter  of  first 
extending  silicon  VLSI  technology  to  gallium  arsenide  for  the  first 
10  fold  speed  up*  then  slowing  developing  optical  suitching*  memory* 

&  processing  for  a  massive  1*000  to  100*000  fold  improvement  in  speeds 
This  will  be  mainly  very  basic  hardware  research  in  the  next  few  years 
Conpututatlon  speedups  with  optical  computing  uill  com*  both  from 
faster  signalling  and  much  easier  massive  parallelism  with  optical 
distribution  on  data*  but  speeds  in  10  years  probably  uill  be  in  the 
10  to  100  BIPS  range  only  Programming  and  us*  issues  uill  come  more 
slowly  . 

I  hop*  this  helps.  Let  me  knou  if  you  need  more  evaluations. 

Larry  Wittle*  19  Aug  1936 
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Dear  Dharna. 


Thank  you  for  your  letter  of  flay  9  and  the  list  of  topics  on  which  we 
should  comment.  I  an  sorry  for  not  responding  earlier.  (I  have  been 
In  and  out  of  toun  in  flay.  )  Following  are  my  general  comments.  Please  le 
me  know  if  you  uant  me  to  elaborate  on  uhat  is  said  here  or  to  provide 
more  references.  Do  you  want  tho  copy  of  the  transparenc ies  used  in 
the  discussion  session  returned  with  sons  kind  of  markings 

Regards< 

jane 


Although  most  of  the  position  papers  delivered  in  the  Workshop 
on  Future  Directions  in  Software  and  Architecture  are  concerned  with 
hardware  aspects  of  computer  architecture!  the  attendees  whom  I 
Z  had  opportunity  to  talk  with  informally  during  the  Workshop  agreed 
that  the  most  pressing  problems  in  the  future  are  in  softuare 
design  and  development.  Hou  to  write  programs  for  parallel  machines 
and  distributed  systems/  bou  to  maximize  concurrency*  hou  to  increase 
reliability  of  the  overall  systems!  end  hou  to  make  softuare  reusable 
are  examples  of  these  pressing  problems.  But  among  all  these  problems! 
the  most  important  one  of  them/  in  my  opinion/  is  hou  to  bring  together 
hardware  design  and  software  design  and  engineering  disciplines  in  order 
to  find  good  overall  system  design/  analysis  and  synthesis  methods. 

Currently/  computer  hardware  designers  and  software  designers 
almost  always  work  in  insolation  from  each  other,  rather  than  taking 
a  system-design-er iented  approach.  The  majority  of  the  papers  given 
at  the  Workshop  demonstrate  this  fact  quite  well.  This  traditional 
approach  of  partitioning  the  system  design  problem  into  design  of 
machine  instruction  setsi  memory  hierarchy/  interconnection  networks, 
control  structures,  scheduling  and  resource  management  algorithms,  etc 
has  served  us  uell  thus  far.  It  was  neccessary  to  understand  the  design 
Issues  concerning  individual  components  of  any  system  before  one  can 
address  the  design  issues  concerning  the  overall  system.  Many  of  the 
well  adopted  system  design  approaches,  (eg.,  the  layered  approach 
to  design  of  distributed  systems  and  computer  networks. )  encourages 
the  partitioning  of  systems  into  components  and  attacking  the  design 
and  development  of  components  separately. 

In  recent  years,  as  computer  systems  become  more  complex  and 
their  applications  more  critical,  higher  performance,  fault  tolerance 
and  availability  become  more  essential.  Addressing  the  performance 
and  reliability  of  subsystems  independently  of  each  other  often 
results  in  lack  of  overall  efficiency  and  robustness.  It  is  time 
to  develop  top  down,  integrated  approaches  to  system  (software 
and  hardware  together)  design,  fly  paper  on  distributed,  macro 
dataflow  architecture  is  an  attempt  in  this  direction,  in  general. 

The  view  of  the  distributed  system  as  consisting  of  a  collection  of 
servers  which  can  be  invoked  to  perform  different  computation  and 
commun i cat  ion  primitives  discussed  in  my  paper  is  consistent  with 
the  view  pointed  presented  the  in  the  paper  entitled  "An  approach  to 
decentral i zed  computer  system"  by  J.  N.  0. ay.  IEEF  Transactions  on 
Softuare  Engineering.  June.  1986. 


Dharma* 

Below  are  tone  of  my  general  comments  about  recommendations  to 
the  Army  CARO*  and  others)  about  funding  priorities.  This  is  not 
particularly  an  ordered  list*  and  obviously  reflects  my  research 
Interests. 

By  the  way*  thanks  for  your  hard  work  on  the  workshop.  It  generally 
went  well*  despite  the  housing  problems.  a  future  workshop*  however, 
should  probably  attempt  to  select  a  higher  quality  set  of  attendees 
Maybe  a  more  personal  and  strongly  solicitation  uould  have  been 
appropr late? 

thanks  again  for  your  efforts. 

Alan  Smith 


Memory  Hierarchies: 

Extremely  Important  topic.  Highest  funding  priorities  to  CPU 
Cache  Memories  and  Disk  Cache.  The  former  is  important  for  high 
performance  CPUs*  as  are  used  in  many  ueapons*  guidance  systems* 
and  C&C  systems.  The  latter  is  important  to  large  data  processing 
instal latlons  such  as  the  Army  has  it  it*  larger  bases. 

Also  Important  is  optimization  of  memory  systems  in  distributed  systems. 
Memory  interconnection  networks  are  worth  some  study*  especially 
in  conjunction  with  (a)  cache  memories  and  (b>  implementation. 

More  pure  modelling  studies  are  probably  not  called  for  at  this  time. 

Instruction  Set  Considerations 

My  impression  is  that  the  military  computer  architecture  standard* 
which  IVve  actually  never  looked  at.  i*  16-bit.  If  so.  it  is  time 
for  a  32-blt  standard. 

Further  research  on  RISC  vs.  CISC  is  called  for,  with  more  comparisons 
and  sound  scientific  studies. 

The  increased  use  of  custom  chips  as  opposed  to  large  software 
systems  and  standard  processors  should  be  studied. 

There  is  the  need  (as  uc  discussed  at  the  end  of  the  workshop)  for  the 
Army  to  have  a  Computing  Research  Advisory  Board,  to  be  used  to  oversee 
the  award  of  research  grants  and  contracts,  through  ARO  and  other 
organi  zations  <E.  g.  Bill)  Huntsville).  I  can  suggest  appropriate 
people  (including  myself)  if  requested. 

Work  on  reconfiguration  strategies  is  unlikely  to  be  worth  funding 
until  someone  has  a  working,  usable*  programmab le  reconf  igurab  le 
computer.  At  that  point,  some  experiments  and  further  studies 
are  called  for. 

The  Army  (or  DOD)  should  allocate  a  lot  of  money  0*10,000.000)  to  place 
parallel  machines  (e. g  Encore,  Sequent)  in  universities  and  research 
labs,  for  studies  of  parallel  software  development,  granularity  issues, 
etc. 

My  personal  opinion  is  that  small  grain  parallelism  is  a  poor  idea  for 
most  applications  and  uonl't  work. 

Likewise,  I  think  that  multiprocessors  uith  large  numbers  of  processors 
(currently,  more  than  100)  are  a  very  questionable  undertaking  until 
we  learn  hou  to  make  and  use  MPs  uith  small  numbers  of  processors. 

Such  systems  are  likely  to  be  of  use  only  for  certain  specialized 
app 1 lcat ions. 


Mapping  algorithm*  and  task  assignment  -  this  needs  to  be  studied  in  the 
context  of  real  harduare  on  uhich  to  experiment.  <see  above). 

More  abstract  studies  should  be  deemphas i zed  for  the  time  being. 

Reusable  and  retargetable  software  is  an  idea  whose  time  never  seems 
to  come.  Everyone  uants  to  write  software  for  OTHER  people  to  reuse. 

No  one  ever  wants  to  use  existing  software.  Z  would  not  put  money 
into  this. 

Conversely*  I  would  put  money  into  experiments  on  software  productivity 
where  existing  porsonnel  are  given  STATE  OF  THE  ART  software  environments. 

(e.  g  SUN  uor testations,  good  ADA  compilers,  syntax  directed  editors, 
interactive  debuggers,  large  mainframe),  and  are  to  work  on  regular, 
normal,  military  software  projects.  My  suspicion  is  that  the  software 
environments  are  probably  lousy,  and  that  the  Army  is  not  using  existing 
and  known  techniques. 

The  development  and  optimization  of  distributed  operating  systems  is 
important.  One  important  project  would  be  to  fund  development  of 
distributed  (generally  available)  UNIX.  (E.  g.  Berkeley  3.0  BSD). 

What  is  needed  is  a  real  distributed  operating  system  on  which  people 
can  experiment.  (see  also  above  about  miking  parallel  systems  available). 
This  is  a  generally  important  area. 

Currency  control  -  currently  too  many  theoretical  studies,  too  little  actual 
data,  implementation,  and/or  practical  studies. 
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Comments  on 

Future  Directions  in  Computer  Architecture  and  Software 
Mary  Jana  Iruin 

This  write-up  discuss#*  rasaarch  diractions  in  only  four  of 
tha  sassion  topic  araas. 

Two  (custom  chips  and  grainularity  issues) 
ar#  diractly  ralatad  to  our  ARO  supportad  rasaarch  affort 
at  Pan n  Stata  and  two  ara  araas  which  ara  of  interest  to  the 
rasaarchars  at  SRC. 

Custom  Chips 

Rasaarch  has  avolvad  in  tha  past  several  years  from  a  single  chip 
design  to  systems  design*  an  assemblage  of  many  custom  chips  as  well 
as  many  of f-tha-shalf  componants. 

In  systems  design*  not  only  must  the  individual  custom  chips  work 
as  required  but  they  must  also  be  configured  into  a  system  which 
requires  clock  and  data  distribution  across  board(s). 

As  chips  gat  faster  and  systems  more  complex,  both  clock  skew  and 
data  skeu  may  become  problems. 

Thus*  rasaarchars  have  to  learn  to  think  and  look  beyond  tha 
single  chip. 

To  do  this  requires  more  expansive  design  equipment  and  supporting 
personnel. 

CAD  tools  and  equipment  which  support  systems  design  are  largely 
missing  at  Universities,  as  such  tools  are  vary  expensive 
and  not  generally  available  for  free  to  Universities. 

At  least  in  Computer  Science  Department!' s*  supporting  personnel 
are  also  not  usually  available. 

Special  purpose  systems  based  on  unusual  number  representations 
(residue,  logarithmic,  and  signed-digit)  are  now  in  production. 

Memory  Hierarchies  and  Interconnection  Networks 

Research  in  the  areas  of  memory  hierarchies  and  interconnection  networks 
has  been  directed,  in  part,  at  reducing  memory  latency. 

(Interconnection  networks  when  used  to  connect  processors  and  memories 
in  an  MIMD  system;  caches  in  memcry  hierarchies.  > 

Both  of  these  approaches  have  inherent  limits. 

Data  consistency  problems  can  arise  when  using  caches  and  local 
memories. 

As  the  number  of  processors  and  memories  increase,  interconnect i on 
networks  which  are  more  scalable  (e. g.  rooted  networks  like 
H-trees  and  log  n  networks  like  ccc!'s)  are  more  promising 
than  those  that  don’.'t  scale  at  all  (e  g.  <  buses  and  rings). 

Network  hotspots,  combining  networks  (like  Fetch-and-ADD) . 
circuit  versus  packet  switching,  layout  in  VLSI,  and  fault 
tolerance  of  interconnection  netuorks  are  also  important  research 
issues. 

Avoiding  the  memory  latency  problem  uith  new  CPU  arch itectural 
techniques  is  even  more  promising. 

By  issuing  the  next  instruction  to  the  instruction  pipeline  from 
a  different  process  every  cycle*  the  memory  dependency  problem 
can  be  reduced  significantly  or.  removed  altogether. 

This  technique  requires  enough  registers  to  allow  free  (or 
almost  free)  context  switching  every  cycle;  each  process  must 
own  its  oum  register  set  when  'active.  " 
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This  techniques  is  used  in  the  HEP  machines  to  reduce  the 
impact  of  memory  latency. 

How  many  processes  need  to  be  active<  hou  are  the 
register  sets  managed*  uhat  happens  when  the  data  has 
not  yet  arrived  and  the  next  instruction  in  the  process 
is  due  to  be  issued  are  research  issues. 

Orainularity  Issues 

Oralnularlty  can  relate  to  the  physical  size  of  the  processors* 
the  process  size*  or  the  context  switch  cost. 

Processors  which  are  fine  grain  are  attractive  because  of  the  obvious 
VLSI  advantages. 

They  probably  have  bit  or  digit  serial  data  transmission  and  can 
benefit  greatly  in  speed  if  the  operations  can  also  be  pipelined  at 
the  bit  or  digit  level. 

Processes  which  are  fine  grained  (“light  weight”)  can  more  easily 
be  created*  destroyed*  and  exported  in  an  MIHD  evironnent  and 
may  have  advantages  over  the  more  traditional  “heavy  weight"  processes. 
Hore  research  needs  to  be  done  in  this  area. 

Fine  grained  context  switching  allows  one  to  avoid  the  memory 
latency  problem  with  round  robin  instruction  issuing  as  discussed 
in  the  section  above. 

Issues  in  TIIMn  Architectures. 

Both  message  passing  systems  (like  the  iPSC  and  NCUBE) 
and  shared  memory  systems  (like  the  B8N  Butterfly  and 
RP3)  are  now  in  production. 

The  debate  continues  as  to  which  is  the  better  approach  for 
an  MIMC  machine. 

However*  building  the  machine  is  only  one  part  of  the 
prob  len. 

The  design  of  parallel  languages  to 

accommodate  such  architectures  and  the  development 

of  parallel  algorithm,  both 

very  important  research  areas. 

would  be  encouraged  by  making  such 

machines  available  to  the  University  researcher. 

Is  a  hybrid  system  feasible,  one  which  relies  on 
message  passing  for  those  operations  which  work  best  in  that 
environment  and  one  which  relies  on  shared  memory  for 
the  others. 
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List  of  topics  I  feel  are  important  to  pursue  INCLUDES  (in 

random  order): 

RISC  vs  CISC  architecture  -  when  is  each  appropriate;  what 
are  the  tradeoffs. 

parallel  processing  system  memories  -  local  mem.  vs  global 
shared  mem.  vs  hybrid  approaches;  "hot  spots"  in 
shared  memory  accessing;  use  for  synchronization 
primitives;  when  is  the  "fetch  and  add"  primitive 
cost-effective? 

interconnection  networks  for  large-scale  parallel  processing 
-  use  of  optics  for  implementation;  fault  tolerance; 
tradeoffs  among  packet-switching,  circuit-switching 
and  hybrid  approaches;  tradeoffs  between  multistage 
cube  and  hyper  cube  approaches. 

SIMD/MIMD/reconf igurable  systems  -  operating  system  methods 
for  switching  between  SIMD  and  MIMD  modes,  and  for 
performing  other  reconfigurations;  language  features 
to  specify  reconfiguration;  hardware  support  t'or 
efficient  reconfiguration;  mapping  algorithms  to 
reconf igurable  systems. 

mapping  algorithms  and  architectures  -  models  of  algorithms 
and  architectures  and  how  they  interact;  given 
parallel  architecture  -  which  algorithm  approach 
best;  given  algorithm  -  which  architecture  best; 
given  a  reconf igurable  system  -  which 
configuration/algorithm  pair  best. 

parallel  programming  -  what  language  features  needed  for 
efficient  "explicit"  specification  of  parallelism; 
what  features  needed  for  effectively  compilable 
"implicit"  specification  of  parallelism;  portable 
parallel  languages  for  sharing  work;  common  methods 
for  expressing  parallel  algorithms  so  that 
researchers  can  share  and  communicate  results  among 
themselves  more  easily;  tradeoffs  between  the 
efficiency  of  writing  machine  dependent  "explicit" 
specification  of  parallelism  programs  vs  machine 
independent  "implicit"  specification  of  parallelism; 
developing  an  i  documenting  a  set  of  parallel 
programming  hr-  :rniques. 
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